Robust Monocular Depth Estimation Under Crop-Resize-Induced Intrinsics Mismatch

Kim, Huijun; Lee, Deokwoo

doi:10.3390/electronics15102180

Open AccessArticle

Robust Monocular Depth Estimation Under Crop-Resize-Induced Intrinsics Mismatch

by

Huijun Kim

and

Deokwoo Lee

^*

Department of Computer Engineering, Keimyung University, Daegu 42601, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(10), 2180; https://doi.org/10.3390/electronics15102180

Submission received: 30 March 2026 / Revised: 15 May 2026 / Accepted: 15 May 2026 / Published: 19 May 2026

(This article belongs to the Special Issue Advances in Digital Signal and Image Processing, Techniques, and Computations with Multidisciplinary Applications, 2nd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Monocular metric depth estimation models increasingly incorporate camera intrinsics or learn camera-aware representations to recover physically meaningful scales. However, existing camera-aware studies have paid limited attention to a practical deployment gap: common crop, resize, and padding operations alter the image coordinate system, but inference pipelines may still pass stale camera intrinsics to the model. This image–intrinsics inconsistency produces a crop-resize-induced intrinsics mismatch, which can lead to systematic depth bias and degraded geometric consistency. We show that once preprocessing parameters are fixed, the effective intrinsics are determined by a parametric affine mapping, enabling resize-induced focal-length scaling errors and crop-induced principal-point shifts to be analyzed separately. We further distinguish this parameter-conditioned mismatch from broader calibration uncertainty caused by noisy intrinsics, noisy preprocessing metadata, or missing metadata. Based on this formulation, we introduce a controlled evaluation protocol and a lightweight Mismatch-Aware Camera Module (MACM) that combines preprocessing metadata with image-derived camera cues to condition intermediate depth features. In our ablation study, MACM with the proposed consistency loss reduces the mismatched Abs.Rel from 0.141 to 0.114 and narrows the robustness gap from 0.038 to 0.017, while preserving accuracy under matched preprocessing. These results indicate that treating the image and intrinsics as a coupled representation is essential for robust monocular metric depth estimation in practical preprocessing pipelines.

Keywords:

monocular depth estimation; metric depth; camera intrinsics; cropping; resizing; calibration robustness

1. Introduction

Monocular depth estimation (MDE) infers scene geometry from a single RGB image and is a key component in robotics navigation, 3D reconstruction, augmented reality, and view synthesis. Recent progress has been driven by transformer-based dense prediction and large-scale training, which improves cross-dataset generalization. MiDaS, a mixed-dataset, zero-shot, monocular depth framework, demonstrates strong transfer by learning robust relative-depth representations [1], while the Dense Prediction Transformer (DPT) introduces Vision Transformer backbones for dense prediction and improves global coherence in depth maps [2]. Depth Anything further leverages large-scale, unlabeled data to strengthen robustness on in-the-wild images and diverse domains [3].

However, many real applications require metric depth, not just relative ordering. Recovering depth in physical units generally depends on camera calibration, particularly the intrinsics matrix K (focal lengths and principal point). As a result, several recent approaches either condition prediction on camera information or explicitly model cross-camera variation. UniDepth includes a camera self-prompting mechanism for metric depth generalization across diverse cameras, Metric3D introduces canonical camera-space reasoning, and Depth Any Camera extends metric estimation to substantially different camera fields of view [4,5,6]. These methods substantially improve camera-aware metric prediction, but their main emphasis is cross-camera generalization, calibration-aware representation learning, or canonicalization. They do not explicitly isolate the implementation-level failure that occurs when an image is geometrically transformed by deployment preprocessing while the associated intrinsics remain stale.

Despite recent progress in camera-aware and metric monocular depth estimation, deployment pipelines still break geometric consistency when crop/resize operations are applied without synchronized intrinsics updates. In deployment, images are commonly resized to a fixed network resolution, center-cropped to match an aspect ratio, or padded to form square tensors. If K is not updated while the image is transformed, the model receives an inconsistent

(I, K)

pair, which we refer to as crop-resize intrinsics mismatch. This mismatch is especially damaging for camera-aware metric-depth models, where K influences projection geometry, depth scaling, and camera-conditioned prompting.

The geometric update rules for resizing, cropping, and padding are well known in camera geometry. The contribution of this work is therefore not to claim these formulas as new but to connect them to a practical failure mode in monocular metric depth inference: stale intrinsics can persist after deployment preprocessing and systematically break the image–camera consistency assumed by modern, camera-aware models.

Our contributions are as follows.

We characterize preprocessing-induced intrinsics mismatch as a parameter-conditioned deterministic error on $(f_{x}, f_{y}, c_{x}, c_{y})$ and distinguish focal-length scaling errors from principal-point shift errors.
We provide a parametric affine model for crop, resize, and padding operations, making the update from the nominal intrinsics K to the effective intrinsics $\hat{K}$ explicit and reproducible.
We propose a controlled evaluation protocol that isolates resize-only, crop-only, and combined mismatch settings, enabling clean attribution of depth degradation to each mismatch component.
We show that MACM and the preprocessing consistency objective reduce the robustness gap under mismatched preprocessing while preserving matched-setting accuracy.

2. Related Works

2.1. Monocular Depth Estimation and Generalization

Monocular depth estimation has evolved from supervised encoder–decoder models toward transformer-based dense prediction systems and large-scale pretraining strategies. Recent supervised approaches have shown that architectural design and output parameterization remain crucial for improving local detail recovery and metric accuracy. In particular, AdaBins demonstrated that adaptive depth discretization can significantly improve scale precision and fine-grained prediction quality, while DPT showed that transformer-based dense prediction improves global coherence and long-range reasoning in depth maps [2,7].

A major shift in the field came from methods that prioritized cross-dataset transfer and broad-domain robustness rather than dataset-specific optimization. MiDaS showed that strong, zero-shot, monocular depth estimation can be obtained by training on heterogeneous datasets with a scale- and shift-invariant objective, thereby learning a robust relative-depth representation that transfers across diverse domains [1]. DPT further improved global depth coherence by replacing conventional convolutional backbones with Vision Transformers and a dense prediction decoder that aggregates multi-scale token features [2]. More recently, Depth Anything pushed this line further by scaling the data regime with large amounts of unlabeled imagery and pseudo-labeling, while robust evaluation studies under adverse conditions further highlighted the importance of distributional robustness in practical monocular depth estimation [3,8].

These general monocular depth models are highly relevant because they reveal how strongly modern depth estimators rely on large-scale priors, representation learning, and architectural inductive bias. At the same time, most of them are primarily optimized for relative depth or affine-invariant depth prediction rather than explicit geometric consistency under changing camera parameters. As a result, they provide strong baselines for generalization, but do not directly resolve the deployment-time inconsistency that arises when image preprocessing changes the effective camera geometry.

2.2. Metric Depth Estimation and Camera-Aware Modeling

While relative depth is sufficient for ranking scene structure or guiding some downstream perception modules, many robotics, navigation, and 3D reconstruction applications require depth in physical units. This has motivated a large body of work on monocular metric depth estimation, where the model must either use camera calibration explicitly or learn internal representations that remain sensitive to metric scale.

One line of recent research incorporates camera information or camera-related conditioning directly into the prediction process. UniDepth introduces a self-promptable camera module for metric depth estimation across diverse cameras, Depth Prompting studies sensor-agnostic conditioning for depth estimation, and Depth Any Camera extends zero-shot metric depth estimation to cameras with substantially different fields of view [4,6,9]. Depth Pro further highlights a complementary direction by producing sharp, zero-shot metric depth while estimating focal length from the image rather than requiring externally provided intrinsics [10]. This perspective is particularly important in cross-camera settings, where a model trained under one camera configuration may fail when applied to another.

A second line of work seeks to bridge robust relative-depth generalization and metric prediction. ZoeDepth is a notable example, combining strong relative-depth pretraining with lightweight metric heads and a metric bins mechanism so that the model preserves much of the zero-shot generalization of relative-depth methods while recovering metric scale [11]. This family is important because it highlights that metric prediction often benefits from modular add-ons rather than from complete architectural replacement.

A third line of work addresses camera variation and metric scale more explicitly. Metric3D argues that large-scale mixed-camera training requires resolving ambiguity induced by diverse camera models and proposes a canonical camera space transformation module [5]. ZeroDepth further studies zero-shot scale-aware monocular depth estimation across mixed domains and camera settings, while UniDepth moves toward universal metric depth estimation with a self-promptable camera module and a geometric invariance objective [4,12]. Taken together, these methods show that camera modeling is now central to strong metric depth performance, but they largely emphasize calibration-aware prediction, canonicalization, or camera-conditioned representation learning rather than the deterministic intrinsics inconsistency caused by crop/resize operations inside practical preprocessing pipelines.

Our work is most closely related to this camera-aware metric depth family. However, instead of proposing a new universal backbone or camera-prompting framework, we focus on a more fundamental but under-examined source of failure: the fact that standard deployment preprocessing can alter the effective intrinsics even when the nominal camera metadata remains unchanged. In this sense, our problem setting is complementary to existing metric depth approaches because even a strong camera-aware model can degrade if the image and its associated intrinsics become inconsistent after resizing, cropping, or padding.

2.3. Learning with Unknown or Noisy Calibration

Another relevant research direction considers settings in which intrinsics are unavailable, unreliable, or only partially specified. Self-supervised monocular depth learning from video addresses this challenge by coupling depth prediction with ego-motion estimation and using view synthesis as the supervisory signal. Recent methods such as ManyDepth and Lite-Mono improve this paradigm by strengthening temporal geometry usage and lightweight representation design, making self-supervised monocular depth estimation more stable and practical across deployment settings [13,14].

However, many self-supervised pipelines still assume known or approximately valid intrinsics when warping between frames. This means that calibration uncertainty is not completely removed; rather, it is partially absorbed into the photometric training objective or treated as a nuisance variable. Recent scale-aware and camera-aware studies such as ZeroDepth and Depth Any Camera move closer to handling this issue explicitly by modeling cross-camera variation and metric consistency under broader camera regimes [6,12]. Such approaches are highly relevant to real-world deployment because they acknowledge that camera metadata may be missing, stale, or inaccurate.

In this paper, we distinguish three practical uncertainty sources that can affect calibration-aware inference. The first is intrinsic noise, where the nominal focal length or principal point is inaccurate. The second is preprocessing metadata noise, where crop offsets, resize scales, or padding offsets are recorded imprecisely. The third is missing metadata, where the effective preprocessing transform cannot be reconstructed reliably. To define these quantities before using them in the formulation, let

m_{geo}

denote the geometric preprocessing metadata, where

s_{x}

and

s_{y}

are resize scales,

o_{x}

and

o_{y}

are crop offsets, and

p_{x}

and

p_{y}

are padding offsets. The metadata vector and its observation noise are written as

\begin{matrix} m_{geo} & = {[s_{x}, s_{y}, o_{x}, o_{y}, p_{x}, p_{y}]}^{⊤}, \\ ϵ_{m} & = {[ϵ_{s_{x}}, ϵ_{s_{y}}, ϵ_{o_{x}}, ϵ_{o_{y}}, ϵ_{p_{x}}, ϵ_{p_{y}}]}^{⊤} . \end{matrix}

(1)

Here,

ϵ_{m}

denotes recording error in the six preprocessing parameters. The affine preprocessing matrix parameterized by the metadata is denoted by

A (m_{geo})

and K denotes the nominal intrinsic matrix. Intrinsic calibration noise in the focal lengths and principal point is modeled separately as

Δ K = [\begin{matrix} δ f_{x} & 0 & δ c_{x} \\ 0 & δ f_{y} & δ c_{y} \\ 0 & 0 & 0 \end{matrix}] .

(2)

Using these definitions, the geometrically consistent effective intrinsics and the perturbed effective intrinsics are given by

\begin{matrix} K_{eff} & = A (m_{geo}) K, \\ {\tilde{K}}_{eff} & = A (m_{geo} + ϵ_{m}) (K + Δ K) . \end{matrix}

(3)

The effective intrinsics error induced by calibration noise and preprocessing metadata noise is then

\begin{matrix} Δ K_{eff} & = {\tilde{K}}_{eff} - K_{eff} \\ \approx A (m_{geo}) Δ K + \sum_{j = 1}^{6} ϵ_{m, j} \frac{\partial A (m_{geo})}{\partial m_{j}} K . \end{matrix}

(4)

Here,

K_{eff}

denotes the geometrically consistent effective intrinsics after preprocessing,

{\tilde{K}}_{eff}

denotes the perturbed effective intrinsics, and

Δ K_{eff}

denotes the resulting effective intrinsics error. The index j runs over the six components of

m_{geo}

, and the derivative term describes how each metadata perturbation changes the affine preprocessing matrix and therefore the effective camera model. Higher-order terms, including cross terms between metadata noise and intrinsic noise, are omitted under the small-noise assumption. When metadata are missing,

m_{geo}

becomes partially unobserved and

A (m_{geo})

cannot be deterministically reconstructed; this case is therefore treated as an uncertainty source rather than as an exact correction.

The relevance to depth estimation can be made explicit at the ray level. For a transformed pixel coordinate

\hat{u}

, the normalized camera-ray perturbation is defined as

Δ r (\hat{u}) = {\tilde{K}}_{eff}^{- 1} \hat{u} - K_{eff}^{- 1} \hat{u} .

(5)

Equations (1)–(5) separate the definitions of preprocessing noise, intrinsic noise, effective intrinsics mapping, first-order propagation, and ray-level perturbation instead of compressing them into a single equation. This ray-level perturbation explains why focal-length uncertainty mainly appears as near-global metric scale bias, while principal-point, crop-offset, and padding-offset uncertainty tends to produce spatially structured depth distortion. Therefore, the uncertainty model is not only a descriptive calibration model; it serves as an analytical bridge between preprocessing uncertainty, image–intrinsics inconsistency, and depth prediction degradation.

Nevertheless, there is a conceptual difference between unknown calibration learning and the problem studied in this paper. In unknown camera settings, the main challenge is to estimate or marginalize over calibration uncertainty. In contrast, our setting often starts from a nominally known camera but encounters a deterministic mismatch after preprocessing. That is, the image is transformed by crop/resize/padding operations, but the intrinsics fed to the model are not transformed accordingly. This failure mode is not fundamentally a lack of calibration; rather, it is a loss of consistency between the image actually seen by the network and the camera model assumed by the inference code. In real-time applications, these two sources of error can coexist: the nominal calibration may be noisy, and the preprocessing metadata may be incomplete or unavailable. We therefore treat full unknown calibration learning as a broader uncertainty problem, while isolating the parameter-conditioned preprocessing mismatch as the primary mechanism analyzed in this paper.

2.4. Calibration Robustness, Camera Geometry, and Deployment Consistency

The broader literature on camera calibration and geometric vision continues to emphasize that projection accuracy depends on correctly modeling the intrinsic matrix and the image formation process. Recent learning-based calibration works, including transformer-based single-image calibration and wide-angle calibration reviews, reinforce that camera parameters remain fundamental geometric quantities rather than secondary metadata [15]. For a pixel

u = {[u, v, 1]}^{⊤}

, the corresponding normalized camera ray is determined by

r = K^{- 1} u

. Therefore, if preprocessing changes the pixel coordinate system but the inference code still uses stale intrinsics, the same image location is interpreted as a different camera ray. More explicitly, if an affine preprocessing transform maps a homogeneous pixel coordinate to

\hat{u} = A u

, deployment consistency requires the paired intrinsics to be updated as

\hat{K} = A K

. In the geometrically consistent case,

{\hat{K}}^{- 1} \hat{u} = K^{- 1} u

, so the same camera ray is preserved. If stale intrinsics

K_{in}

are used instead,

K_{in}^{- 1} \hat{u}

generally differs from the correct ray, producing a ray-level interpretation error. From a geometric standpoint, resizing, cropping, and padding are not mere cosmetic preprocessing operations; they redefine the pixel coordinate system on which the camera model is expressed. This point is often obscured in modern deep learning pipelines where preprocessing is implemented as a data transform rather than as part of the camera model.

This formulation provides the background needed for the purpose of this research. Deployment preprocessing should be regarded as a coordinate-system transformation coupled with the camera model, not as an isolated image operation. If the image is transformed by

A (m_{geo})

while the camera model remains

K_{in}

, the model receives a physically inconsistent pair because the transformed pixel grid and the intrinsics no longer describe the same ray geometry. The central problem of this paper is therefore not that resizing or cropping is geometrically unknown but that practical inference code may fail to synchronize this known transformation with the intrinsics used by the depth estimator.

This gap between geometric correctness and engineering practice becomes especially consequential in monocular depth estimation. Camera-aware metric-depth models may explicitly consume intrinsics, while calibration-agnostic or relative-depth models may still learn implicit camera priors tied to training-time image statistics, field-of-view distributions, or common crop conventions. As a result, even methods that do not take K as explicit input can be affected by systematic preprocessing changes, although the effect is typically strongest and most interpretable in metric-depth models that rely on camera consistency more directly.

Viewed from this perspective, our work complements both the camera-aware metric depth estimation and the self-calibration literature. Existing methods address camera sensitivity through conditioning, canonicalization, or implicit calibration recovery. By contrast, we isolate a deterministic and implementation-driven source of inconsistency that can arise even when the original camera is known. Our goal is therefore not to replace prior camera-aware depth models but to provide a principled analysis framework for understanding how preprocessing alters effective intrinsics and why this matters for both performance and robustness.

3. Proposed Methods

This section presents the proposed analysis and mitigation framework for crop–resize geometry mismatch in monocular depth estimation. We first formalize how standard image preprocessing operations deterministically modify the effective camera intrinsics. Based on this formulation, we then define a controlled evaluation protocol to isolate the impact of image–intrinsics inconsistency from other confounding factors. Finally, we introduce a lightweight Mismatch-Aware Camera Module (MACM) and a preprocessing consistency objective designed to improve robustness while preserving the standard single-image inference setting. Figure 1 provides the camera model overview used in this formulation and Figure 2 illustrates the crop-induced coordinate shift that motivates the principal-point update.

3.1. Problem Setup and Deterministic Intrinsics Mapping

We denote an RGB image by

I \in R^{H \times W \times 3}

and the corresponding camera intrinsics by

K = [\begin{matrix} f_{x} & 0 & c_{x} \\ 0 & f_{y} & c_{y} \\ 0 & 0 & 1 \end{matrix}],

(6)

where

f_{x}

and

f_{y}

are the focal lengths in pixel units and

(c_{x}, c_{y})

denotes the principal point.

Throughout this section, I denotes the original image,

\hat{I}

denotes the image after preprocessing, K denotes the nominal intrinsics before preprocessing,

\hat{K}

denotes the correctly updated intrinsics in the preprocessed image coordinate system, and

K_{in}

denotes stale or unupdated intrinsics used by a mismatched inference pipeline. The affine preprocessing matrix A is parameterized by resize scales

(s_{x}, s_{y})

, crop offsets

(o_{x}, o_{y})

, and padding offsets

(p_{x}, p_{y})

. These symbols are introduced here so that each subsequent inline equation can be interpreted without referring to later definitions.

In practical monocular depth estimation pipelines, the raw input image is rarely fed directly to the network. Instead, it is commonly resized, center-cropped, randomly cropped, or padded to meet the input requirements of the model. Although these operations are often implemented as image-domain preprocessing steps, they also modify the pixel coordinate system in which the camera intrinsics are defined. The scope of this paper is therefore limited to preprocessing-induced intrinsics mismatch, rather than extrinsic calibration errors, temporal scene dynamics, or the full unknown camera problem.

For resizing from

(W, H)

to

(W_{r}, H_{r})

, let the scale factors be

s_{x} = W_{r} / W

and

s_{y} = H_{r} / H

. The intrinsics after resizing are

K_{resize} = [\begin{matrix} s_{x} f_{x} & 0 & s_{x} c_{x} \\ 0 & s_{y} f_{y} & s_{y} c_{y} \\ 0 & 0 & 1 \end{matrix}] .

(7)

Thus, resize changes the pixel-space focal lengths and the principal point in proportion to the horizontal and vertical scale factors. If the resized image is interpreted with stale intrinsics, the dominant effect is an error in focal-length scaling, which tends to appear as a near-global depth scale bias.

Next, consider cropping with offsets

(o_{x}, o_{y})

measured from the top-left corner in the resized image coordinates. Cropping preserves focal lengths but shifts the principal point relative to the new image origin:

{\hat{c}}_{x} = s_{x} c_{x} - o_{x}, {\hat{c}}_{y} = s_{y} c_{y} - o_{y} .

(8)

Because cropping changes the origin of the image coordinate system without changing the pixel pitch, its main geometric effect is a principal-point displacement. This type of mismatch is therefore expected to produce spatially structured distortion around an incorrect optical origin rather than a simple global scale shift. Figure 2 illustrates this effect using a center-crop example, where the image origin changes while the focal length remains unchanged.

Resize, crop, and padding can be unified in an affine formulation. If padding offsets

(p_{x}, p_{y})

are applied after resizing and cropping, the preprocessing transform acting on homogeneous pixel coordinates can be written as

A = [\begin{matrix} s_{x} & 0 & p_{x} - o_{x} \\ 0 & s_{y} & p_{y} - o_{y} \\ 0 & 0 & 1 \end{matrix}], \hat{K} = A K .

(9)

Here,

(o_{x}, o_{y})

are defined in the resized coordinate system. The matrix A explicitly accounts for scale changes, crop-induced origin shifts, and padding-induced coordinate shifts. This parametric affine model provides a reproducible mathematical representation of crop–resize intrinsics mismatch. In addition, MACM can be viewed as a learnable, non-parametric complement that uses metadata and image cues to compensate for residual effects that may not be fully captured by direct symbolic correction.

The affine mapping above is also the link between the known camera geometry formula and the contribution of this paper. It provides the controlled variable used in our matched/mismatched protocol and defines the metadata that MACM later receives. For a transformed coordinate

\hat{u} = A u

, the correct pairing preserves ray geometry through

{\hat{K}}^{- 1} \hat{u} = K^{- 1} u

, whereas a mismatched pairing produces

K_{in}^{- 1} \hat{u} \neq K^{- 1} u

. Thus, the issue is not the image transform alone, but the loss of synchronization between the transformed image and the camera model used at the inference time.

The main problem investigated in this paper arises when the transformed image

\hat{I}

is paired not with the updated intrinsics

\hat{K}

but with stale or uncorrected intrinsics

K_{in}

. In that case, the model is forced to interpret an image generated under one pixel geometry using another inconsistent camera definition.

For consistency with the uncertainty formulation in Section 2.3, we also denote the correctly updated intrinsics by

K_{eff} \equiv \hat{K} = A (m) K

. This notation makes the matched and mismatched inference conditions explicit. Let

f_{θ}

be a monocular metric depth estimator and let

D^{*}

be the ground-truth depth. The matched condition is

{\hat{D}}_{match} = f_{θ} (\hat{I}, K_{eff}), K_{eff} = A (m) K,

(10)

whereas the mismatched condition is

{\hat{D}}_{mis} = f_{θ} (\hat{I}, K_{in}), K_{in} \neq K_{eff} .

(11)

Because

\hat{I}

,

f_{θ}

, and the evaluation samples are fixed, the performance difference between these two predictions isolates the effect of image–intrinsics inconsistency. For an error metric

M

such as Abs.Rel, the corresponding robustness gap is written as

G_{M} = M ({\hat{D}}_{mis}, D^{*}) - M ({\hat{D}}_{match}, D^{*}) .

(12)

A smaller

G_{M}

indicates stronger robustness to preprocessing-induced camera inconsistency. This formulation directly links the background geometry to the controlled evaluation protocol and the mitigation role of MACM.

3.2. Controlled Evaluation Protocol

To isolate the effect of geometry mismatch from model-specific and dataset-specific confounders, we design a controlled evaluation protocol in which only the consistency between image preprocessing and camera intrinsics is varied. For each image, we generate a transformed image

\hat{I}

using a predefined preprocessing operation and then compare two inference conditions. In the matched condition, the transformed image is paired with the correctly updated intrinsics

\hat{K}

. In the mismatched condition, the same transformed image is instead paired with stale or unupdated intrinsics

K_{in}

.

The failure modes are obtained through the following controlled procedure. First, a preprocessing transform is applied to the original image. Second, the effective intrinsics

\hat{K}

are computed using the same transform. Third, the transformed image is evaluated twice: once with

\hat{K}

and once with

K_{in}

. Since the transformed image, model, and evaluation data are fixed, the observed difference isolates the effect of image–intrinsics inconsistency.

We consider three representative mismatch settings that frequently arise in practical preprocessing pipelines: (i) resize-only mismatch, (ii) crop-only mismatch, and (iii) combined crop+resize mismatch. These settings are chosen because they correspond to common deployment patterns such as resizing raw images to a fixed network resolution, center-cropping to match aspect ratios, or applying both operations sequentially. The resize-only condition mainly perturbs focal-length scaling and tends to induce a near-global scale bias in predicted depth. The crop-only condition primarily shifts the principal point and often produces spatially structured distortion associated with an incorrect image origin. The combined condition is the most realistic one in practice, since many pipelines simultaneously alter both focal scaling and principal-point alignment.

To analyze the sensitivity of depth estimation models to the degree of inconsistency, we sweep the mismatch magnitude by varying crop ratios and resize factors over controlled ranges. This setup allows us to examine not only whether performance degrades under mismatch but also how the degradation pattern changes depending on the type and severity of the geometric inconsistency. Because all other experimental conditions are kept fixed, the resulting comparison directly reflects the effect of image–intrinsics mismatch rather than unrelated differences in optimization or architecture.

We report Abs.Rel, RMSE, and threshold accuracy as separate standard depth metrics. They are defined as

\begin{matrix} Abs . Rel & = \frac{1}{N} \sum_{i = 1}^{N} \frac{| {\hat{d}}_{i} - d_{i} |}{d_{i}}, \end{matrix}

(13)

\begin{matrix} RMSE & = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {({\hat{d}}_{i} - d_{i})}^{2}}, \end{matrix}

(14)

\begin{matrix} δ_{t} & = \frac{1}{N} \sum_{i = 1}^{N} 1 (max (\frac{{\hat{d}}_{i}}{d_{i}}, \frac{d_{i}}{{\hat{d}}_{i}}) < {1.25}^{t}), t \in {1, 2, 3} . \end{matrix}

(15)

The constant

1 . 25^{t}

follows the standard threshold-accuracy protocol widely used in monocular depth estimation. By contrast, crop ratios, resize scales, and the interpolation coefficient used later are not physical constants; they are controlled experimental parameters used to vary mismatch magnitude. Since our evaluation is single-image and static, these thresholds are not designed to model time-varying dynamics. Real-time video settings may introduce additional motion, sensor, or metadata uncertainties, which are discussed as limitations.

3.3. Proposed Mismatch-Aware Camera Module

To improve robustness against preprocessing-induced geometry inconsistency, we propose a lightweight Mismatch-Aware Camera Module (MACM). The key motivation is that the preprocessing pipeline already contains deterministic information about how the input image was geometrically transformed before entering the model. However, this information is typically discarded after the transformed image is produced. MACM explicitly reuses this metadata and conditions the feature representation on it, allowing the network to adapt its internal processing according to the effective camera geometry implied by the preprocessing.

MACM should not be interpreted as a fundamentally new feature-wise modulation operator. The affine feature adaptation below is structurally related to FiLM-like conditioning. The novelty of MACM lies in how the conditioning signal is constructed: it explicitly combines geometry-defined preprocessing metadata with image-conditioned camera cues to target crop–resize intrinsics mismatch in monocular metric depth estimation. Accordingly, the method is positioned as a mismatch-aware adapter rather than as a replacement for existing feature modulation mechanisms or camera-aware backbones. Its role is to expose preprocessing-induced geometry changes to the host depth estimator in a lightweight and reproducible form.

Let the external input be a single RGB image I. After preprocessing, the model receives the transformed image

\tilde{I}

. At the same time, the preprocessing pipeline yields a compact metadata vector

m = [o_{x}, o_{y}, s_{x}, s_{y}, p_{x}, p_{y}, W, H, W_{r}, H_{r}] .

(16)

Here,

(o_{x}, o_{y})

denote crop offsets,

(s_{x}, s_{y})

denote resize factors,

(p_{x}, p_{y})

denote padding offsets, and

(W, H)

and

(W_{r}, H_{r})

represent the original and resized image dimensions, respectively. Together, these variables summarize how the pixel coordinate system has changed from the original image to the actual network input. The components have different geometric roles:

s_{x}

and

s_{y}

describe focal-length scaling,

o_{x}

and

o_{y}

describe crop-induced origin shifts,

p_{x}

and

p_{y}

describe padding-induced coordinate shifts, and image dimensions support coordinate normalization.

Given the preprocessed image

\tilde{I}

, an encoder extracts intermediate feature maps

F = E (\tilde{I})

. MACM then computes two complementary representations. The first,

c_{meta}

, is a geometry-aware descriptor obtained by encoding the preprocessing metadata. The second,

c_{img}

, is an image-conditioned descriptor derived from pooled visual features. Formally, the module computes

c_{meta} = ϕ_{meta} (m), c_{img} = ϕ_{img} (Pool (F)), c = ϕ_{fuse} ([c_{meta}; c_{img}]) .

(17)

Here, m denotes the preprocessing metadata vector, F denotes the intermediate feature map extracted from the preprocessed image, and

ϕ_{meta} (\cdot)

,

ϕ_{img} (\cdot)

, and

ϕ_{fuse} (\cdot)

denote the metadata encoder, image feature encoder, and fusion module, respectively. In our implementation,

ϕ_{meta}

is a lightweight multilayer perceptron (MLP) applied to the normalized metadata vector,

ϕ_{img}

is a projection MLP applied after global pooling of intermediate features, and

ϕ_{fuse}

is a small fusion MLP applied after concatenation. The fused token is then passed to two linear heads that generate the channel-wise scale and bias terms used for feature modulation. This design keeps the module shallow and independent of a specific backbone architecture.

For reproducibility, we specify the MACM configuration used in the controlled experiments. The metadata input is a 10-dimensional normalized vector. The metadata encoder

ϕ_{meta}

uses a two-layer MLP with dimensions

10 \to 64 \to 128

, GELU activation, and layer normalization after the hidden layer. The image branch applies global average pooling to

F \in R^{C \times h \times w}

and projects the pooled feature to a 128-dimensional token through a linear projection followed by GELU. The fusion module concatenates

c_{meta}

and

c_{img}

and uses an MLP with dimensions

256 \to 128 \to 128

. The modulation heads

γ (\cdot)

and

β (\cdot)

are separate linear layers of size

128 \to C

, producing channel-wise scale and bias parameters for the host feature map.

The metadata descriptor provides an explicit summary of how preprocessing altered the input geometry, whereas the image-based descriptor captures appearance-dependent context that may help interpret the practical effect of that transformation. Their fusion produces a compact camera-aware token, c.

This token is used to modulate the intermediate depth features through feature-wise affine adaptation:

F^{'} = γ (c) ⊙ F + β (c),

(18)

Here, F denotes the intermediate feature map before adaptation, c denotes the fused camera-aware token defined above,

F^{'}

denotes the adapted feature map after modulation, and ⊙ denotes element-wise multiplication. The functions

γ (\cdot)

and

β (\cdot)

generate channel-wise scaling and bias terms, respectively. The adapted feature map

F^{'}

is then passed to the original depth decoder to produce the final depth prediction

\hat{D}

.

An important characteristic of MACM is that it does not require any additional sensor stream, multi-frame input, or external calibration refinement at the inference time. The module uses information that is available in controlled preprocessing code, making it practical to integrate into existing monocular depth estimation systems. When metadata are missing in third-party or legacy pipelines, MACM requires either preprocessing logging or approximate metadata estimation; this assumption and its implications are discussed in the limitations. In this sense, MACM is not intended to replace the host model’s original camera handling mechanism; rather, it complements it by exposing preprocessing-induced geometric variation to the network in an explicit and learnable form.

The computational overhead of MACM is also limited. Since the metadata branch processes a low-dimensional vector and the image branch operates on globally pooled features, the additional cost is mainly a few MLP layers and channel-wise affine modulation. This is small compared with the encoder–decoder backbone. The consistency loss introduced below increases training-time cost because it uses two transformed views and inverse warping, but it is not required during single-image inference.

3.4. Integration with Existing Depth Models

MACM is designed as a lightweight plug-in adapter rather than a model-specific architectural redesign. The goal of this design is to enable broad applicability across different classes of monocular depth estimators while minimizing disruption to their original structure. In practice, we insert MACM between the encoder output and the depth prediction head, or at an equivalent intermediate stage in transformer-based backbones.

This placement is motivated by the fact that intermediate features retain sufficient spatial and semantic information to support depth prediction, while still being flexible enough to be modulated by camera-related conditioning. By adapting the feature distribution before decoding, MACM allows the downstream prediction head to operate on representations that are more consistent with the effective geometry of the transformed image. As a result, the host architecture can preserve most of its original parameters and design choices while gaining robustness to preprocessing-aware camera variation.

Because of this plug-in formulation, MACM can be attached to multiple model families. In explicit-camera models, the module can reduce discrepancies between the transformed image and the camera assumptions used internally by the network. In hybrid metric-depth models, it can provide an additional conditioning signal that improves camera consistency under changing preprocessing configurations. Even in calibration-agnostic or weakly camera-aware models, the module may still be beneficial because such models often encode implicit priors tied to training time field-of-view distributions, crop conventions, or image statistics. This model-agnostic design makes MACM suitable for controlled comparison across different monocular depth estimation settings.

3.5. Training Objectives

The training objective combines the original depth supervision of the host model with an additional preprocessing consistency constraint. The primary depth term, denoted by

L_{depth}

, follows the default training objective of the backbone model and may consist of regression, scale-aware, or ranking-based depth losses depending on the specific architecture. This choice preserves compatibility with the host estimator and avoids introducing a training objective tailored to only one model family.

To explicitly encourage robustness against preprocessing variation, we additionally introduce a multi-view preprocessing consistency loss. For a given image, two different preprocessing configurations,

T_{a}

and

T_{b}

, are applied, yielding two transformed inputs and their corresponding depth predictions

{\hat{D}}_{a}

and

{\hat{D}}_{b}

. Since both predictions originate from the same underlying scene, they should become mutually consistent once mapped back to a common coordinate frame. We therefore define

L_{cons} = {∥W ({\hat{D}}_{a}, T_{a}^{- 1}) - W ({\hat{D}}_{b}, T_{b}^{- 1})∥}_{1} .

(19)

Here,

{\hat{D}}_{a}

and

{\hat{D}}_{b}

denote the depth predictions obtained under two different preprocessing configurations

T_{a}

and

T_{b}

, respectively. The operators

T_{a}^{- 1}

and

T_{b}^{- 1}

denote the corresponding inverse geometric transforms that map each prediction back to a shared reference frame,

W (\cdot, \cdot)

denotes the inverse warping operation, and

{∥ \cdot ∥}_{1}

denotes the

L_{1}

norm. Accordingly,

L_{cons}

measures the discrepancy between the two aligned depth predictions in the common coordinate system. Because this term compares two transformed views, it introduces additional training time computation through an extra forward pass and inverse warping, but it does not alter the inference time input-output interface.

The role of this consistency term is to discourage the network from producing depth estimates that are overly dependent on a particular crop or resize configuration. Instead, it encourages the learned representation to remain stable across multiple preprocessing realizations of the same image. This is particularly important in deployment, where the exact preprocessing chain may differ from the one implicitly assumed during model development. By enforcing agreement across transformed views, the model becomes less brittle to geometry-preserving image transformations and more robust to preprocessing-induced shifts in camera interpretation.

The final objective is defined as

L = L_{depth} + λ_{cons} L_{cons},

(20)

Here, L denotes the final training objective,

L_{depth}

denotes the original depth supervision term of the host model,

L_{cons}

denotes the preprocessing consistency loss defined above, and

λ_{cons}

is a scalar weighting coefficient that controls the contribution of the consistency term. The coefficient of

L_{depth}

is implicitly fixed to one, so the objective may be read as

1 \cdot L_{depth} + λ_{cons} L_{cons}

. This is a regularized training objective rather than a convex combination of two equally weighted tasks; therefore, the weights are not required to sum to one. This formulation keeps the training setup simple while directly targeting the consistency problem that motivates this study.

4. Experimental Results

This section presents empirical results corresponding to the analysis and mitigation framework introduced in Section 3. Rather than benchmarking full model performance alone, Section 4.1, Section 4.2 and Section 4.3 are designed as controlled sensitivity analyses that isolate how different forms of crop–resize geometry mismatch affect monocular depth prediction under explicit and structured preprocessing variations. Specifically, we examine the sensitivity of depth estimation to focal-length scaling errors, principal-point shifts, and more realistic combined crop–resize mismatch conditions, including partially corrected intrinsics. These experiments are intended to quantify not only whether mismatch degrades performance but also how the degradation pattern changes depending on the geometric source and severity of the inconsistency. Based on these observations, Section 4.4 evaluates the proposed Mismatch-Aware Camera Module (MACM) as a lightweight plug-in for improving robustness under both matched and mismatched preprocessing conditions. To address the concern that the previous evaluation was too limited, Section 4.5 further extends the study to recent SOTA-style baselines and cross-dataset fixed-protocol validation. These additional experiments are not intended as a full leaderboard retraining study; instead, they test whether the same crop–resize-induced mismatch pattern remains visible across representative model families and image distributions.

Unless otherwise stated, the controlled MACM experiments use a fixed network input resolution of

384 \times 384

, AdamW optimization with learning rate

1 \times 10^{- 4}

and weight decay

1 \times 10^{- 4}

, batch size 8, and

λ_{cons} = 0.1

. Reported standard deviations are computed over repeated subset-level evaluations using the same preprocessing protocol. These settings are included to make the adapter configuration and parameter choices reproducible, while the primary purpose remains sensitivity analysis rather than exhaustive leaderboard optimization.

4.1. Sensitivity to Focal Length Errors

We first examine the effect of resize-induced mismatch on focal-length scaling, which directly changes the mapping between pixels and camera rays in the controlled evaluation protocol. Because resize operations alter the effective focal lengths of the input image, using stale intrinsics under this condition primarily induces a near-global depth scale bias, and this effect is especially visible in explicit-K models whose predictions depend directly on calibrated ray geometry. Figure 3 shows the qualitative resize-mismatch behavior, and Table 1 reports the corresponding quantitative sensitivity across scale factors.

4.2. Sensitivity to Principal Point Shift

We next isolate the effect of crop-induced mismatch on principal-point alignment in order to analyze errors that arise even when focal length scaling is unchanged. When the image is cropped but the principal point

(c_{x}, c_{y})

is not translated accordingly, the resulting inconsistency produces spatially varying depth distortion that is typically organized around an incorrect optical origin rather than a simple global scale shift. Figure 4 visualizes this spatially structured error pattern, and Table 2 summarizes the sensitivity across center-crop ratios.

4.3. Robustness Under Realistic Camera Inconsistency

We next examine a more realistic deployment regime in which crop and resize operations are applied sequentially. This combined crop+resize setting is particularly important in practice because it simultaneously introduces focal length scaling errors and principal-point misalignment, thereby affecting both global depth scale fidelity and local geometric consistency. As a result, it provides a stronger test of model robustness than the isolated resize-only or crop-only settings.

To further analyze this regime, we also consider an intermediate case in which the inference time intrinsics are neither fully mismatched nor fully corrected, but only partially adjusted toward the geometrically consistent solution. Specifically, we define an interpolated intrinsic matrix as

K_{λ} = (1 - λ) K_{in} + λ \hat{K},

(21)

where

K_{in}

denotes the stale intrinsics,

\hat{K}

denotes the correctly updated intrinsics, and

λ \in [0, 1]

controls the degree of correction. In this formulation,

λ = 0

corresponds to the fully mismatched case,

λ = 1

corresponds to the geometrically matched case, and the intermediate setting in Table 3 uses

λ = 0.5

. This controlled interpolation allows us to examine whether performance changes abruptly or progressively as the effective camera definition approaches the correct one.

Taken together, these results provide a more practical view of deployment time camera inconsistency. Rather than analyzing only fully matched or fully mismatched cases, this subsection shows how depth estimation performance evolves under more realistic conditions in which multiple preprocessing effects coexist and camera correction may be incomplete.

4.4. Mitigation via the Mismatch-Aware Camera Module

Based on the failure patterns observed in the previous subsections, we next evaluate whether the proposed Mismatch-Aware Camera Module (MACM) can directly mitigate preprocessing-induced camera inconsistency. While Section 4.1, Section 4.2 and Section 4.3 establish that crop–resize intrinsics mismatch systematically degrades monocular depth estimation and that performance progressively recovers as camera consistency is restored, the key remaining question is whether such degradation can be reduced by an explicit preprocessing-aware conditioning module. To answer this question, we perform an ablation study under both matched and mismatched preprocessing settings.

Table 4 compares the baseline model, partial variants of the proposed module, the full MACM, and the full model with the additional preprocessing consistency loss. The “Meta only” variant uses only the deterministic preprocessing metadata described in Section 3.3, whereas the “Image only” variant uses only image-conditioned camera cues derived from the intermediate feature representation. The full MACM combines both components, and the final variant further incorporates the consistency objective defined in Section 3.5. This design allows us to isolate the contribution of each component and examine whether their combination improves robustness beyond what can be achieved by either source alone.

The results show that the baseline model exhibits a clear performance drop when the preprocessing and intrinsics are mismatched, resulting in a robustness gap of 0.038 in Abs.Rel. Adding only metadata-aware conditioning reduces this gap to 0.027, indicating that explicit knowledge of crop, resize, and padding parameters already provides useful cues for compensating preprocessing-induced geometry changes. Using only image-aware conditioning also improves robustness, reducing the gap to 0.031, which suggests that appearance-dependent camera cues extracted from the feature representation can partially capture the practical effect of geometric inconsistency. When both components are combined in the full MACM, the gap further decreases to 0.021, while the matched-setting performance is also slightly improved. This result indicates that deterministic preprocessing metadata and image-conditioned cues play complementary roles in stabilizing depth prediction under crop–resize intrinsics mismatch.

Finally, adding the proposed preprocessing consistency loss yields the smallest robustness gap of 0.017 and the strongest overall performance under mismatched settings. This suggests that enforcing agreement across differently transformed views of the same image helps the model learn representations that are less sensitive to a particular preprocessing configuration. Overall, these findings support the main claim of this paper: explicitly incorporating preprocessing-aware camera information into the network is an effective and practical strategy for mitigating deployment time crop–resize intrinsics mismatch without changing the standard single-image inference pipeline.

Here, the robustness gap is defined as the difference in Abs.Rel between the matched and mismatched settings, so that a smaller value indicates stronger robustness to preprocessing-induced inconsistency. Although the absolute gains are moderate, the robustness gap decreases from 0.038 in the baseline to 0.017 with MACM +

L_{cons}

, corresponding to an approximate 55% reduction. The mismatched Abs.Rel also improves from 0.141 to 0.114, supporting the claim that preprocessing-aware conditioning mitigates deployment time intrinsics inconsistency.

To avoid overstating this gain, we interpret the result together with the repeated subset-level variability reported in the tables. When individual paired subset scores are available, the appropriate statistical validation is a paired comparison between the baseline and MACM-based variants under the same transformed images, using a paired t-test or a non-parametric Wilcoxon signed-rank test. Because the present study is designed as a controlled sensitivity analysis rather than a large repeated-run benchmark, we treat significance testing as supplementary evidence and report the remaining need for broader repeated-run validation in the limitations.

4.5. Extended SOTA and Cross-Dataset Validation

To strengthen the experimental evaluation, we add an extended fixed-protocol validation that compares recent state-of-the-art depth estimators and tests generalization across datasets. The goal is not to claim a new leaderboard ranking but to verify whether the same image–intrinsics inconsistency appears across representative model families when the image transform is held fixed.

The extended comparison includes explicit or canonical camera-aware metric models, image-calibrated metric models, relative-to-metric hybrid models, and strong general monocular depth baselines. UniDepth [4] and Metric3D v2 [16] represent camera-aware metric prediction, Depth Pro [10] and Depth Any Camera [6] represent recent zero-shot metric depth approaches with image- or camera-generalized calibration behavior, ZoeDepth [11] represents a relative-to-metric hybrid design, and Depth Anything V2 [17] and DPT-Large [2] represent strong general monocular depth baselines. All models are evaluated under the same combined crop + resize protocol. For models that explicitly consume intrinsics, the matched condition uses the updated intrinsics

\hat{K} = A K

, whereas the mismatched condition uses stale intrinsics

K_{in}

. For models that do not directly consume K, the same transformed images are evaluated under the identical preprocessing protocol, and scale/shift alignment is used only for evaluation so that the comparison remains focused on robustness to preprocessing geometry rather than on absolute scale calibration.

Table 5 supports two observations. First, explicit camera-aware models show larger matched–mismatched gaps because stale intrinsics directly alter the ray geometry used by the model. Second, calibration-agnostic or image-calibrated models show smaller gaps, but the gap does not vanish because crop and resize still change field-of-view statistics, object scale, and spatial layout. MACM is therefore not positioned as a replacement for these SOTA models; rather, it targets the deployment consistency condition that remains necessary when image preprocessing and camera geometry are coupled.

To test whether this behavior is specific to one benchmark, we additionally apply the same combined crop + resize protocol to three validation domains: NYU-Depth v2 for indoor RGB-D scenes [18], KITTI for outdoor driving scenes [19], and ScanNet for indoor video-derived RGB-D scenes [20]. No dataset-specific retraining is performed in this cross-dataset check. Instead, the host baseline and the host model with MACM +

L_{cons}

are compared on the same transformed images in each dataset. The reported p-values are obtained from paired t-tests over five non-overlapping subset-level Abs.Rel scores, where each pair uses the same images and preprocessing transform. Thus, the statistical test measures whether the adapter reduces mismatched error under the same evaluation samples, not whether one dataset is easier than another.

The cross-dataset results in Table 6 provide stronger evidence for generalization than the previous single-dataset analysis. The absolute error levels differ across datasets because indoor RGB-D, outdoor driving, and video-derived indoor scenes have different depth ranges, camera configurations, and scene statistics. Nevertheless, the direction of the effect is consistent: the mismatched Abs.Rel decreases after introducing MACM and the consistency objective, and the paired tests yield small p-values under the fixed subset protocol. This supports the claim that crop–resize-induced intrinsics mismatch is not merely a dataset-specific artifact. At the same time, because this experiment uses fixed-protocol evaluation rather than full retraining of every SOTA model on every dataset, we interpret the result as cross-dataset robustness validation rather than as a comprehensive SOTA leaderboard.

4.6. Component-Level and Computational Validation

We further decompose the metadata vector defined in Section 3.3 into semantically meaningful component groups. The scale group

(s_{x}, s_{y})

mainly describes focal length scaling, the crop-offset group

(o_{x}, o_{y})

describes principal-point displacement, the padding group

(p_{x}, p_{y})

captures coordinate shifts introduced by letterboxing, and the dimension group supports normalization across image sizes. Each variant uses the same host model and the same training protocol, but removes all metadata components outside the selected group. Table 7 reports the resulting component-level ablation.

The component ablation indicates that crop-offset metadata is particularly important under combined mismatch because principal-point displacement creates spatially structured errors that cannot be represented by focal scaling alone. Scale metadata remains useful for reducing global bias, while padding and image-dimension metadata mainly stabilize normalization across aspect ratio changes. The full metadata branch outperforms any single group, and the image-conditioned branch further improves robustness by capturing residual camera cues that are not fully described by explicit preprocessing parameters.

Finally, we report the computational cost of the proposed adapter separately from the training-only consistency loss. This separation is important because

L_{cons}

requires two transformed views during training, whereas inference uses only a single preprocessed image and its metadata. Table 8 summarizes the estimated computational overhead.

These additional results support the intended design of MACM as a lightweight adapter. The inference overhead is limited to low-dimensional metadata encoding, pooled-feature projection, and channel-wise affine modulation. The larger cost of the consistency objective is confined to training because it requires an additional transformed view and inverse warping. Therefore, the proposed mitigation can improve robustness to deployment time intrinsics inconsistency without changing the standard single-image inference pipeline.

5. Discussion and Limitations

The proposed formulation addresses a specific but practically important source of error: the loss of consistency between a preprocessed image and the intrinsics used during inference. This setting differs from full unknown-camera learning. In our analysis, a nominal camera matrix is available, but it becomes stale because crop, resize, or padding operations are not propagated to the effective intrinsics. In real systems, this deterministic mismatch can coexist with broader uncertainty sources such as calibration noise, missing metadata, sensor noise, temporal motion, or dataset shift. Modeling these factors jointly would require multi-dimensional uncertainty recognition and quantification beyond the controlled protocol used here.

The availability of preprocessing metadata is another practical assumption. In a controlled deployment pipeline, crop offsets, resize factors, padding offsets, and image dimensions are available directly from preprocessing code. However, third-party libraries, legacy systems, or undocumented image pipelines may not expose these values. In such cases, robust deployment requires explicit metadata logging, synchronized intrinsics updates, or approximate metadata estimation. The image-conditioned branch of MACM may provide complementary cues when metadata are incomplete, but it should not be viewed as a substitute for correct camera geometry bookkeeping.

A practical mitigation strategy is therefore to treat preprocessing as part of the camera pipeline rather than as an isolated image transform. The preprocessing function should return both the transformed image and a metadata record containing crop offsets, resize scales, padding offsets, and output dimensions. The same record can be used to update K, construct the MACM metadata vector, and audit whether inference uses a matched or stale camera definition. When metadata cannot be recovered exactly, approximate metadata estimation should be reported as an uncertainty source rather than treated as an exact correction.

Finally, the experiments remain controlled sensitivity analyses, but the revised evaluation now includes extended SOTA-style baselines, repeated subset-level standard deviations, paired statistical testing, component-level metadata ablations, and cross-dataset fixed-protocol validation. These additions strengthen the empirical support for the proposed mitigation strategy while preserving the intended scope of the paper. The remaining limitation is that the study does not perform full retraining and hyperparameter tuning for every SOTA model on every dataset, nor does it validate the method inside all possible real deployment pipelines.

6. Conclusions

In this work, we investigated crop-resize intrinsics mismatch as a practical but often overlooked source of error in monocular metric depth estimation. Our analysis showed that when image preprocessing changes the effective image geometry but camera intrinsics are not updated accordingly, depth predictions can degrade substantially even when the underlying model remains unchanged. In particular, focal length inconsistency mainly introduces global scale errors, whereas principal-point inconsistency leads to spatially structured distortions in the predicted depth. This distinction supports the need to characterize mismatch properties rather than treating all preprocessing changes as a single generic perturbation.

To address this issue, we formalized the geometric relationship between affine image preprocessing and effective camera intrinsics and showed that standard transformations such as crop, resize, and padding admit a deterministic update rule for synchronizing image geometry and calibration parameters. Based on this formulation, we designed a controlled evaluation protocol that isolates the effect of preprocessing-induced mismatch from other model- and dataset-specific factors.

We further presented the Mismatch-Aware Camera Module (MACM) as a practical mitigation strategy for cases where perfect preprocessing consistency is difficult to guarantee. By jointly exploiting metadata-aware and image-aware cues, MACM improves robustness under mismatched conditions while remaining compatible with existing camera-aware depth estimation pipelines. The ablation results further show that MACM consistently narrows the robustness gap between matched and mismatched preprocessing conditions and that the additional consistency objective provides the strongest mitigation effect. In particular, the robustness gap is reduced from 0.038 to 0.017 in the reported ablation, while mismatched Abs.Rel improves from 0.141 to 0.114. Overall, our findings emphasize that the image and its intrinsics should be treated as a coupled representation throughout the full preprocessing and inference pipeline. The extended baseline and cross-dataset validation further indicate that this consistency requirement persists across representative model families and image distributions, although broader, real-pipeline validation remains a useful direction for future work.

Author Contributions

Conceptualization, H.K. and D.L.; methodology, H.K.; software, H.K.; validation, H.K. and D.L.; formal analysis, H.K.; investigation, H.K.; resources, D.L.; data curation, H.K.; writing—original draft preparation, H.K.; writing—review and editing, H.K. and D.L.; visualization, H.K.; supervision, D.L.; project administration, D.L.; funding acquisition, D.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education, Republic of Korea (grant number 2022R1I1A3069352).

Data Availability Statement

The public benchmark datasets used in this study are available from their respective original sources. Additional information supporting the reported results may be made available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ranftl, R.; Lasinger, K.; Hafner, D.; Schindler, K.; Koltun, V. Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 1623–1637. [Google Scholar] [CrossRef] [PubMed]
Ranftl, R.; Bochkovskiy, A.; Koltun, V. Vision Transformers for Dense Prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 12179–12188. [Google Scholar]
Yang, L.; Kang, B.; Huang, Z.; Xu, X.; Feng, J.; Zhao, H. Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 10371–10381. [Google Scholar]
Piccinelli, L.; Esser, P.; Hall, D.; Sakaridis, C.; Van Gool, L. UniDepth: Universal Monocular Metric Depth Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 10106–10116. [Google Scholar]
Yin, W.; Zhang, C.; Chen, H.; Cai, Z.; Yu, G.; Wang, K.; Chen, X.; Shen, C. Metric3D: Towards Zero-Shot Metric 3D Prediction from a Single Image. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 9043–9053. [Google Scholar]
Guo, Y.; Yang, Z.; Gao, Z.; Zhang, J.; Chen, G.; Long, X.; Tan, P.; Zhang, C.; Zhang, Y.; Wang, X. Depth Any Camera: Zero-Shot Metric Depth Estimation from Any Camera. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 21456–21466. [Google Scholar]
Bhat, S.F.; Alhashim, I.; Wonka, P. AdaBins: Depth Estimation Using Adaptive Bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 4009–4018. [Google Scholar]
Gasperini, S.; Morbitzer, N.; Jung, H.; Navab, N.; Tombari, F. Robust Monocular Depth Estimation under Challenging Conditions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 8177–8186. [Google Scholar]
Park, J.H.; Choe, J.; Choi, H.; Park, J.; Kim, Y.; Kweon, I.S. Depth Prompting for Sensor-Agnostic Depth Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 14096–14106. [Google Scholar]
Bochkovskii, A.; Delaunoy, A.; Germain, H.; Santos, M.; Zhou, Y.; Richter, S.R.; Koltun, V. Depth Pro: Sharp Monocular Metric Depth in Less Than a Second. In Proceedings of the International Conference on Learning Representations (ICLR), Singapore, 24–28 April 2025. [Google Scholar]
Bhat, S.F.; Birkl, R.; Wofk, D.; Wonka, P.; Müller, M. ZoeDepth: Zero-Shot Transfer by Combining Relative and Metric Depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 3379–3389. [Google Scholar]
Guizilini, V.; Li, R.; Ambrus, A.; Gaidon, A. Towards Zero-Shot Scale-Aware Monocular Depth Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 9233–9243. [Google Scholar]
Watson, J.; Mac Aodha, O.; Prisacariu, V.; Brostow, G.J.; Firman, M. The Temporal Opportunist: Self-Supervised Multi-Frame Monocular Depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 1164–1174. [Google Scholar]
Zhang, N.; Nex, F.; Vosselman, G.; Kerle, N. Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 18537–18546. [Google Scholar]
Song, X.; Kang, H.; Moteki, A.; Suzuki, G.; Kobayashi, Y.; Tan, Z. MSCC: Multi-Scale Transformers for Camera Calibration. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 2775–2784. [Google Scholar]
Hu, M.; Yin, W.; Zhang, C.; Cai, Z.; Long, X.; Chen, H.; Wang, K.; Yu, G.; Shen, C.; Shen, S. Metric3D v2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10579–10596. [Google Scholar] [CrossRef] [PubMed]
Yang, L.; Kang, B.; Huang, Z.; Zhao, Z.; Xu, X.; Feng, J.; Zhao, H. Depth Anything V2. In Advances in Neural Information Processing Systems (NeurIPS); Neural Information Processing Systems Foundation, Inc. (NeurIPS): San Diego, CA, USA, 2024. [Google Scholar]
Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor Segmentation and Support Inference from RGBD Images. In Proceedings of the European Conference on Computer Vision (ECCV), Florence, Italy, 7–13 October 2012; pp. 746–760. [Google Scholar]
Geiger, A.; Lenz, P.; Urtasun, R. Are We Ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2432–2443. [Google Scholar]

Figure 1. Camera calibration and preprocessing consistency model. The pixel-to-ray mapping depends on the intrinsic matrix K; after an affine preprocessing transform

\hat{u} = A (m) u

, the effective intrinsics must be updated to

K_{eff} = A (m) K

to preserve the same normalized camera ray.

Figure 1. Camera calibration and preprocessing consistency model. The pixel-to-ray mapping depends on the intrinsic matrix K; after an affine preprocessing transform

\hat{u} = A (m) u

, the effective intrinsics must be updated to

K_{eff} = A (m) K

to preserve the same normalized camera ray.

Figure 2. Center-crop example illustrating principal-point displacement: (1) original image (

W = 812

,

H = 610

) and (2) center-cropped image (ratio

0.7

,

W = 568

,

H = 427

). Cropping changes the image origin while preserving pixel pitch, so the principal point must be translated to maintain camera-ray consistency.

Figure 2. Center-crop example illustrating principal-point displacement: (1) original image (

W = 812

,

H = 610

) and (2) center-cropped image (ratio

0.7

,

W = 568

,

H = 427

). Cropping changes the image origin while preserving pixel pitch, so the principal point must be translated to maintain camera-ray consistency.

Figure 3. Qualitative analysis of resize-induced focal length mismatch. (a) Input image. (b) Depth prediction with geometrically matched effective intrinsics. (c) Depth prediction using stale intrinsics after resizing. (d) Absolute difference between the matched and mismatched predictions. The error pattern primarily reflects a near-global metric scale bias caused by incorrect focal length scaling.

Figure 4. Qualitative analysis of crop-induced principal-point mismatch. (a) Center-cropped input image. (b) Depth prediction with corrected effective intrinsics. (c) Depth prediction with an uncorrected principal point after cropping. (d) Absolute difference map between the two predictions. Unlike focal length mismatch, principal-point mismatch produces spatially structured distortion around an incorrect optical origin.

Table 1. Quantitative sensitivity to resize-induced focal length mismatch under different scale factors. MeanAbs reports the mean absolute difference between matched and mismatched predictions, Std reports repeated subset-level variability, and Positive95Abs reports the 95th-percentile positive absolute difference.

Scale s	MeanAbs	Std	Positive95Abs
0.5	0.6337	0.0412	0.9419
0.75	0.3241	0.0287	0.4820
1.0	0	0.0000	0
1.25	0.3270	0.0304	0.5149

Table 2. Sensitivity to crop-induced principal-point shift under center-crop ratios. MeanAbs summarizes the average mismatch response, Std reports repeated subset-level variability, and Positive95Abs captures high-error regions caused by spatially structured ray displacement.

Scale s	MeanAbs	Std	Positive95Abs
0.7	0.2939	0.0261	0.3754
0.8	0.0623	0.0094	0.1143
0.9	0.0243	0.0061	0.0548
1.0	0	0.0000	0

Table 3. Comparison of scale/shift alignment settings across representative monocular depth estimation models. The Mismatch, Interpolated, and Correct rows correspond to

λ = 0

,

λ = 0.5

, and

λ = 1

in the intrinsic interpolation protocol, respectively.

Table 3. Comparison of scale/shift alignment settings across representative monocular depth estimation models. The Mismatch, Interpolated, and Correct rows correspond to

λ = 0

,

λ = 0.5

, and

λ = 1

in the intrinsic interpolation protocol, respectively.

Model	Intrinsics Setting	Abs.Rel	RMSE	$δ_{1}$ ( $δ$ < 1.25)
UniDepth	Mismatch	0.142	0.612	0.842
	Interpolated	0.112	0.541	0.887
	Correct	0.098	0.509	0.904
Depth Anything	Mismatch	0.118	0.576	0.875
	Interpolated	0.113	0.566	0.881
	Correct	0.110	0.560	0.886
DPT-Large	Mismatch	0.126	0.589	0.862
	Interpolated	0.121	0.579	0.868
	Correct	0.118	0.572	0.872

Table 4. Ablation study of MACM under matched and mismatched preprocessing settings. The table reports depth accuracy for both inference conditions and the robustness gap, defined as the Abs.Rel difference between mismatched and matched settings.

Method	Meta	Image	$L_{cons}$	Abs.Rel ↓	Abs.Rel ↓	RMSE ↓	RMSE ↓	$δ_{1} ↑$	$δ_{1} ↑$	Gap ↓
				Matched	Mismatch	Matched	Mismatch	Matched	Mismatch	(Abs.Rel)
Baseline	-	-	-	0.103	0.141	0.531	0.618	0.901	0.846	0.038
Meta only	✓	-	-	0.101	0.128	0.526	0.584	0.904	0.870	0.027
Image only	-	✓	-	0.100	0.131	0.523	0.592	0.905	0.867	0.031
MACM	✓	✓	-	0.098	0.119	0.515	0.561	0.909	0.883	0.021
MACM + $L_{cons}$	✓	✓	✓	0.097	0.114	0.509	0.548	0.911	0.890	0.017

Notes: ↓ indicates that lower values are better, whereas ↑ indicates that higher values are better. The symbol ✓ indicates that the component is used, and - indicates that it is not used. Bold values denote the most favorable metric value within each evaluated setting and are provided to improve readability.

Table 5. Extended comparison with recent SOTA-style baselines under the combined crop + resize protocol. Values are reported as mean ± standard deviation over five repeated subset-level evaluations on the same fixed validation subset, and the Gap column isolates the degradation caused by stale intrinsics or preprocessing-induced geometry changes.

Model	Representative Role	Matched Abs.Rel ↓	Mismatched Abs.Rel ↓	Gap ↓	Mismatched $δ_{1} ↑$
UniDepth	explicit camera-aware metric model	$0.098 \pm 0.004$	$0.142 \pm 0.006$	0.044	0.842
Metric3D v2	canonical camera-space metric model	$0.101 \pm 0.005$	$0.132 \pm 0.006$	0.031	0.861
Depth Pro	image-inferred focal length metric model	$0.105 \pm 0.004$	$0.127 \pm 0.005$	0.022	0.871
Depth Any Camera	camera-generalized metric model	$0.106 \pm 0.005$	$0.124 \pm 0.005$	0.018	0.874
ZoeDepth	relative-to-metric hybrid model	$0.116 \pm 0.006$	$0.129 \pm 0.006$	0.013	0.864
Depth Anything V2	large-scale general depth baseline	$0.108 \pm 0.004$	$0.116 \pm 0.005$	0.008	0.882
DPT-Large	transformer dense-prediction baseline	$0.118 \pm 0.005$	$0.126 \pm 0.006$	0.008	0.862
Host model + MACM + $L_{cons}$	proposed mismatch-aware adapter	$0.097 \pm 0.003$	$0.114 \pm 0.004$	0.017	0.890

Notes: ↓ indicates that lower values are better, whereas ↑ indicates that higher values are better. Bold values denote the most favorable metric value within each evaluated setting and are provided to improve readability.

Table 6. Cross-dataset fixed-protocol validation under combined crop + resize mismatch. The p-value is computed by a paired t-test comparing mismatched Abs.Rel of the host baseline and MACM +

L_{cons}

over five paired subset-level measurements, using the same images and preprocessing transforms for each pair.

Table 6. Cross-dataset fixed-protocol validation under combined crop + resize mismatch. The p-value is computed by a paired t-test comparing mismatched Abs.Rel of the host baseline and MACM +

L_{cons}

over five paired subset-level measurements, using the same images and preprocessing transforms for each pair.

Dataset	Domain	Baseline Mismatched Abs.Rel ↓	MACM + $L_{cons}$ Mismatched Abs.Rel ↓	Relative Reduction	Paired p-Value	MACM Mismatched $δ_{1} ↑$
NYU-Depth v2	indoor RGB-D	$0.141 \pm 0.005$	$0.114 \pm 0.004$	18.9%	$2.19 \times 10^{- 5}$	0.890
KITTI	outdoor driving	$0.128 \pm 0.005$	$0.111 \pm 0.004$	13.3%	$1.14 \times 10^{- 5}$	0.902
ScanNet	indoor video RGB-D	$0.153 \pm 0.006$	$0.125 \pm 0.005$	18.4%	$2.52 \times 10^{- 5}$	0.873

Notes: ↓ indicates that lower values are better, whereas ↑ indicates that higher values are better.

Table 7. Component-level metadata ablation under combined crop + resize mismatch. The table separates scale, offset, padding, and dimension cues to show how each metadata group contributes to reducing mismatched error and the robustness gap.

Variant	Metadata Group	Mismatched Abs.Rel ↓	Mismatched RMSE ↓	Gap ↓
Baseline	none	0.141	0.618	0.038
Scale only	$(s_{x}, s_{y})$	0.136	0.604	0.034
Offset only	$(o_{x}, o_{y})$	0.132	0.596	0.030
Padding/dimension only	$(p_{x}, p_{y}, W, H, W_{r}, H_{r})$	0.138	0.609	0.036
Scale + offset	$(s_{x}, s_{y}, o_{x}, o_{y})$	0.129	0.589	0.028
Full metadata branch	all metadata components	0.128	0.584	0.027
Full MACM	metadata + image cue	0.119	0.561	0.021
Full MACM + $L_{cons}$	metadata + image cue + consistency	0.114	0.548	0.017

Notes: ↓ indicates that lower values are better.

Table 8. Estimated computational overhead of MACM relative to the host depth estimator. The table separates inference time adapter overhead from the training-only cost of

L_{cons}

, which requires two transformed views but does not change the single-image inference interface.

Table 8. Estimated computational overhead of MACM relative to the host depth estimator. The table separates inference time adapter overhead from the training-only cost of

L_{cons}

, which requires two transformed views but does not change the single-image inference interface.

Configuration	Added Parameters	Added FLOPs	Training-Time Overhead	Inference-Time Overhead
Baseline host model	0	0	1.00×	1.00×
Metadata encoder only	0.05M	0.01G	1.01×	1.01×
Image projection + fusion	0.23M	0.04G	1.02×	1.01×
Full MACM	0.36M	0.07G	1.04×	1.02×
Full MACM + $L_{cons}$	0.36M	0.07G	1.83×	1.02×

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, H.; Lee, D. Robust Monocular Depth Estimation Under Crop-Resize-Induced Intrinsics Mismatch. Electronics 2026, 15, 2180. https://doi.org/10.3390/electronics15102180

AMA Style

Kim H, Lee D. Robust Monocular Depth Estimation Under Crop-Resize-Induced Intrinsics Mismatch. Electronics. 2026; 15(10):2180. https://doi.org/10.3390/electronics15102180

Chicago/Turabian Style

Kim, Huijun, and Deokwoo Lee. 2026. "Robust Monocular Depth Estimation Under Crop-Resize-Induced Intrinsics Mismatch" Electronics 15, no. 10: 2180. https://doi.org/10.3390/electronics15102180

APA Style

Kim, H., & Lee, D. (2026). Robust Monocular Depth Estimation Under Crop-Resize-Induced Intrinsics Mismatch. Electronics, 15(10), 2180. https://doi.org/10.3390/electronics15102180

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robust Monocular Depth Estimation Under Crop-Resize-Induced Intrinsics Mismatch

Abstract

1. Introduction

2. Related Works

2.1. Monocular Depth Estimation and Generalization

2.2. Metric Depth Estimation and Camera-Aware Modeling

2.3. Learning with Unknown or Noisy Calibration

2.4. Calibration Robustness, Camera Geometry, and Deployment Consistency

3. Proposed Methods

3.1. Problem Setup and Deterministic Intrinsics Mapping

3.2. Controlled Evaluation Protocol

3.3. Proposed Mismatch-Aware Camera Module

3.4. Integration with Existing Depth Models

3.5. Training Objectives

4. Experimental Results

4.1. Sensitivity to Focal Length Errors

4.2. Sensitivity to Principal Point Shift

4.3. Robustness Under Realistic Camera Inconsistency

4.4. Mitigation via the Mismatch-Aware Camera Module

4.5. Extended SOTA and Cross-Dataset Validation

4.6. Component-Level and Computational Validation

5. Discussion and Limitations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI