1. Introduction
Monocular depth estimation (MDE) infers scene geometry from a single RGB image and is a key component in robotics navigation, 3D reconstruction, augmented reality, and view synthesis. Recent progress has been driven by transformer-based dense prediction and large-scale training, which improves cross-dataset generalization. MiDaS, a mixed-dataset, zero-shot, monocular depth framework, demonstrates strong transfer by learning robust relative-depth representations [
1], while the Dense Prediction Transformer (DPT) introduces Vision Transformer backbones for dense prediction and improves global coherence in depth maps [
2]. Depth Anything further leverages large-scale, unlabeled data to strengthen robustness on in-the-wild images and diverse domains [
3].
However, many real applications require metric depth, not just relative ordering. Recovering depth in physical units generally depends on camera calibration, particularly the intrinsics matrix
K (focal lengths and principal point). As a result, several recent approaches either condition prediction on camera information or explicitly model cross-camera variation. UniDepth includes a camera self-prompting mechanism for metric depth generalization across diverse cameras, Metric3D introduces canonical camera-space reasoning, and Depth Any Camera extends metric estimation to substantially different camera fields of view [
4,
5,
6]. These methods substantially improve camera-aware metric prediction, but their main emphasis is cross-camera generalization, calibration-aware representation learning, or canonicalization. They do not explicitly isolate the implementation-level failure that occurs when an image is geometrically transformed by deployment preprocessing while the associated intrinsics remain stale.
Despite recent progress in camera-aware and metric monocular depth estimation, deployment pipelines still break geometric consistency when crop/resize operations are applied without synchronized intrinsics updates. In deployment, images are commonly resized to a fixed network resolution, center-cropped to match an aspect ratio, or padded to form square tensors. If K is not updated while the image is transformed, the model receives an inconsistent pair, which we refer to as crop-resize intrinsics mismatch. This mismatch is especially damaging for camera-aware metric-depth models, where K influences projection geometry, depth scaling, and camera-conditioned prompting.
The geometric update rules for resizing, cropping, and padding are well known in camera geometry. The contribution of this work is therefore not to claim these formulas as new but to connect them to a practical failure mode in monocular metric depth inference: stale intrinsics can persist after deployment preprocessing and systematically break the image–camera consistency assumed by modern, camera-aware models.
Our contributions are as follows.
We characterize preprocessing-induced intrinsics mismatch as a parameter-conditioned deterministic error on and distinguish focal-length scaling errors from principal-point shift errors.
We provide a parametric affine model for crop, resize, and padding operations, making the update from the nominal intrinsics K to the effective intrinsics explicit and reproducible.
We propose a controlled evaluation protocol that isolates resize-only, crop-only, and combined mismatch settings, enabling clean attribution of depth degradation to each mismatch component.
We show that MACM and the preprocessing consistency objective reduce the robustness gap under mismatched preprocessing while preserving matched-setting accuracy.
2. Related Works
2.1. Monocular Depth Estimation and Generalization
Monocular depth estimation has evolved from supervised encoder–decoder models toward transformer-based dense prediction systems and large-scale pretraining strategies. Recent supervised approaches have shown that architectural design and output parameterization remain crucial for improving local detail recovery and metric accuracy. In particular, AdaBins demonstrated that adaptive depth discretization can significantly improve scale precision and fine-grained prediction quality, while DPT showed that transformer-based dense prediction improves global coherence and long-range reasoning in depth maps [
2,
7].
A major shift in the field came from methods that prioritized cross-dataset transfer and broad-domain robustness rather than dataset-specific optimization. MiDaS showed that strong, zero-shot, monocular depth estimation can be obtained by training on heterogeneous datasets with a scale- and shift-invariant objective, thereby learning a robust relative-depth representation that transfers across diverse domains [
1]. DPT further improved global depth coherence by replacing conventional convolutional backbones with Vision Transformers and a dense prediction decoder that aggregates multi-scale token features [
2]. More recently, Depth Anything pushed this line further by scaling the data regime with large amounts of unlabeled imagery and pseudo-labeling, while robust evaluation studies under adverse conditions further highlighted the importance of distributional robustness in practical monocular depth estimation [
3,
8].
These general monocular depth models are highly relevant because they reveal how strongly modern depth estimators rely on large-scale priors, representation learning, and architectural inductive bias. At the same time, most of them are primarily optimized for relative depth or affine-invariant depth prediction rather than explicit geometric consistency under changing camera parameters. As a result, they provide strong baselines for generalization, but do not directly resolve the deployment-time inconsistency that arises when image preprocessing changes the effective camera geometry.
2.2. Metric Depth Estimation and Camera-Aware Modeling
While relative depth is sufficient for ranking scene structure or guiding some downstream perception modules, many robotics, navigation, and 3D reconstruction applications require depth in physical units. This has motivated a large body of work on monocular metric depth estimation, where the model must either use camera calibration explicitly or learn internal representations that remain sensitive to metric scale.
One line of recent research incorporates camera information or camera-related conditioning directly into the prediction process. UniDepth introduces a self-promptable camera module for metric depth estimation across diverse cameras, Depth Prompting studies sensor-agnostic conditioning for depth estimation, and Depth Any Camera extends zero-shot metric depth estimation to cameras with substantially different fields of view [
4,
6,
9]. Depth Pro further highlights a complementary direction by producing sharp, zero-shot metric depth while estimating focal length from the image rather than requiring externally provided intrinsics [
10]. This perspective is particularly important in cross-camera settings, where a model trained under one camera configuration may fail when applied to another.
A second line of work seeks to bridge robust relative-depth generalization and metric prediction. ZoeDepth is a notable example, combining strong relative-depth pretraining with lightweight metric heads and a metric bins mechanism so that the model preserves much of the zero-shot generalization of relative-depth methods while recovering metric scale [
11]. This family is important because it highlights that metric prediction often benefits from modular add-ons rather than from complete architectural replacement.
A third line of work addresses camera variation and metric scale more explicitly. Metric3D argues that large-scale mixed-camera training requires resolving ambiguity induced by diverse camera models and proposes a canonical camera space transformation module [
5]. ZeroDepth further studies zero-shot scale-aware monocular depth estimation across mixed domains and camera settings, while UniDepth moves toward universal metric depth estimation with a self-promptable camera module and a geometric invariance objective [
4,
12]. Taken together, these methods show that camera modeling is now central to strong metric depth performance, but they largely emphasize calibration-aware prediction, canonicalization, or camera-conditioned representation learning rather than the deterministic intrinsics inconsistency caused by crop/resize operations inside practical preprocessing pipelines.
Our work is most closely related to this camera-aware metric depth family. However, instead of proposing a new universal backbone or camera-prompting framework, we focus on a more fundamental but under-examined source of failure: the fact that standard deployment preprocessing can alter the effective intrinsics even when the nominal camera metadata remains unchanged. In this sense, our problem setting is complementary to existing metric depth approaches because even a strong camera-aware model can degrade if the image and its associated intrinsics become inconsistent after resizing, cropping, or padding.
2.3. Learning with Unknown or Noisy Calibration
Another relevant research direction considers settings in which intrinsics are unavailable, unreliable, or only partially specified. Self-supervised monocular depth learning from video addresses this challenge by coupling depth prediction with ego-motion estimation and using view synthesis as the supervisory signal. Recent methods such as ManyDepth and Lite-Mono improve this paradigm by strengthening temporal geometry usage and lightweight representation design, making self-supervised monocular depth estimation more stable and practical across deployment settings [
13,
14].
However, many self-supervised pipelines still assume known or approximately valid intrinsics when warping between frames. This means that calibration uncertainty is not completely removed; rather, it is partially absorbed into the photometric training objective or treated as a nuisance variable. Recent scale-aware and camera-aware studies such as ZeroDepth and Depth Any Camera move closer to handling this issue explicitly by modeling cross-camera variation and metric consistency under broader camera regimes [
6,
12]. Such approaches are highly relevant to real-world deployment because they acknowledge that camera metadata may be missing, stale, or inaccurate.
In this paper, we distinguish three practical uncertainty sources that can affect calibration-aware inference. The first is intrinsic noise, where the nominal focal length or principal point is inaccurate. The second is preprocessing metadata noise, where crop offsets, resize scales, or padding offsets are recorded imprecisely. The third is missing metadata, where the effective preprocessing transform cannot be reconstructed reliably. To define these quantities before using them in the formulation, let
denote the geometric preprocessing metadata, where
and
are resize scales,
and
are crop offsets, and
and
are padding offsets. The metadata vector and its observation noise are written as
Here,
denotes recording error in the six preprocessing parameters. The affine preprocessing matrix parameterized by the metadata is denoted by
and
K denotes the nominal intrinsic matrix. Intrinsic calibration noise in the focal lengths and principal point is modeled separately as
Using these definitions, the geometrically consistent effective intrinsics and the perturbed effective intrinsics are given by
The effective intrinsics error induced by calibration noise and preprocessing metadata noise is then
Here, denotes the geometrically consistent effective intrinsics after preprocessing, denotes the perturbed effective intrinsics, and denotes the resulting effective intrinsics error. The index j runs over the six components of , and the derivative term describes how each metadata perturbation changes the affine preprocessing matrix and therefore the effective camera model. Higher-order terms, including cross terms between metadata noise and intrinsic noise, are omitted under the small-noise assumption. When metadata are missing, becomes partially unobserved and cannot be deterministically reconstructed; this case is therefore treated as an uncertainty source rather than as an exact correction.
The relevance to depth estimation can be made explicit at the ray level. For a transformed pixel coordinate
, the normalized camera-ray perturbation is defined as
Equations (1)–(5) separate the definitions of preprocessing noise, intrinsic noise, effective intrinsics mapping, first-order propagation, and ray-level perturbation instead of compressing them into a single equation. This ray-level perturbation explains why focal-length uncertainty mainly appears as near-global metric scale bias, while principal-point, crop-offset, and padding-offset uncertainty tends to produce spatially structured depth distortion. Therefore, the uncertainty model is not only a descriptive calibration model; it serves as an analytical bridge between preprocessing uncertainty, image–intrinsics inconsistency, and depth prediction degradation.
Nevertheless, there is a conceptual difference between unknown calibration learning and the problem studied in this paper. In unknown camera settings, the main challenge is to estimate or marginalize over calibration uncertainty. In contrast, our setting often starts from a nominally known camera but encounters a deterministic mismatch after preprocessing. That is, the image is transformed by crop/resize/padding operations, but the intrinsics fed to the model are not transformed accordingly. This failure mode is not fundamentally a lack of calibration; rather, it is a loss of consistency between the image actually seen by the network and the camera model assumed by the inference code. In real-time applications, these two sources of error can coexist: the nominal calibration may be noisy, and the preprocessing metadata may be incomplete or unavailable. We therefore treat full unknown calibration learning as a broader uncertainty problem, while isolating the parameter-conditioned preprocessing mismatch as the primary mechanism analyzed in this paper.
2.4. Calibration Robustness, Camera Geometry, and Deployment Consistency
The broader literature on camera calibration and geometric vision continues to emphasize that projection accuracy depends on correctly modeling the intrinsic matrix and the image formation process. Recent learning-based calibration works, including transformer-based single-image calibration and wide-angle calibration reviews, reinforce that camera parameters remain fundamental geometric quantities rather than secondary metadata [
15]. For a pixel
, the corresponding normalized camera ray is determined by
. Therefore, if preprocessing changes the pixel coordinate system but the inference code still uses stale intrinsics, the same image location is interpreted as a different camera ray. More explicitly, if an affine preprocessing transform maps a homogeneous pixel coordinate to
, deployment consistency requires the paired intrinsics to be updated as
. In the geometrically consistent case,
, so the same camera ray is preserved. If stale intrinsics
are used instead,
generally differs from the correct ray, producing a ray-level interpretation error. From a geometric standpoint, resizing, cropping, and padding are not mere cosmetic preprocessing operations; they redefine the pixel coordinate system on which the camera model is expressed. This point is often obscured in modern deep learning pipelines where preprocessing is implemented as a data transform rather than as part of the camera model.
This formulation provides the background needed for the purpose of this research. Deployment preprocessing should be regarded as a coordinate-system transformation coupled with the camera model, not as an isolated image operation. If the image is transformed by while the camera model remains , the model receives a physically inconsistent pair because the transformed pixel grid and the intrinsics no longer describe the same ray geometry. The central problem of this paper is therefore not that resizing or cropping is geometrically unknown but that practical inference code may fail to synchronize this known transformation with the intrinsics used by the depth estimator.
This gap between geometric correctness and engineering practice becomes especially consequential in monocular depth estimation. Camera-aware metric-depth models may explicitly consume intrinsics, while calibration-agnostic or relative-depth models may still learn implicit camera priors tied to training-time image statistics, field-of-view distributions, or common crop conventions. As a result, even methods that do not take K as explicit input can be affected by systematic preprocessing changes, although the effect is typically strongest and most interpretable in metric-depth models that rely on camera consistency more directly.
Viewed from this perspective, our work complements both the camera-aware metric depth estimation and the self-calibration literature. Existing methods address camera sensitivity through conditioning, canonicalization, or implicit calibration recovery. By contrast, we isolate a deterministic and implementation-driven source of inconsistency that can arise even when the original camera is known. Our goal is therefore not to replace prior camera-aware depth models but to provide a principled analysis framework for understanding how preprocessing alters effective intrinsics and why this matters for both performance and robustness.
3. Proposed Methods
This section presents the proposed analysis and mitigation framework for crop–resize geometry mismatch in monocular depth estimation. We first formalize how standard image preprocessing operations deterministically modify the effective camera intrinsics. Based on this formulation, we then define a controlled evaluation protocol to isolate the impact of image–intrinsics inconsistency from other confounding factors. Finally, we introduce a lightweight Mismatch-Aware Camera Module (MACM) and a preprocessing consistency objective designed to improve robustness while preserving the standard single-image inference setting.
Figure 1 provides the camera model overview used in this formulation and
Figure 2 illustrates the crop-induced coordinate shift that motivates the principal-point update.
3.1. Problem Setup and Deterministic Intrinsics Mapping
We denote an RGB image by
and the corresponding camera intrinsics by
where
and
are the focal lengths in pixel units and
denotes the principal point.
Throughout this section, I denotes the original image, denotes the image after preprocessing, K denotes the nominal intrinsics before preprocessing, denotes the correctly updated intrinsics in the preprocessed image coordinate system, and denotes stale or unupdated intrinsics used by a mismatched inference pipeline. The affine preprocessing matrix A is parameterized by resize scales , crop offsets , and padding offsets . These symbols are introduced here so that each subsequent inline equation can be interpreted without referring to later definitions.
In practical monocular depth estimation pipelines, the raw input image is rarely fed directly to the network. Instead, it is commonly resized, center-cropped, randomly cropped, or padded to meet the input requirements of the model. Although these operations are often implemented as image-domain preprocessing steps, they also modify the pixel coordinate system in which the camera intrinsics are defined. The scope of this paper is therefore limited to preprocessing-induced intrinsics mismatch, rather than extrinsic calibration errors, temporal scene dynamics, or the full unknown camera problem.
For resizing from
to
, let the scale factors be
and
. The intrinsics after resizing are
Thus, resize changes the pixel-space focal lengths and the principal point in proportion to the horizontal and vertical scale factors. If the resized image is interpreted with stale intrinsics, the dominant effect is an error in focal-length scaling, which tends to appear as a near-global depth scale bias.
Next, consider cropping with offsets
measured from the top-left corner in the resized image coordinates. Cropping preserves focal lengths but shifts the principal point relative to the new image origin:
Because cropping changes the origin of the image coordinate system without changing the pixel pitch, its main geometric effect is a principal-point displacement. This type of mismatch is therefore expected to produce spatially structured distortion around an incorrect optical origin rather than a simple global scale shift.
Figure 2 illustrates this effect using a center-crop example, where the image origin changes while the focal length remains unchanged.
Resize, crop, and padding can be unified in an affine formulation. If padding offsets
are applied after resizing and cropping, the preprocessing transform acting on homogeneous pixel coordinates can be written as
Here, are defined in the resized coordinate system. The matrix A explicitly accounts for scale changes, crop-induced origin shifts, and padding-induced coordinate shifts. This parametric affine model provides a reproducible mathematical representation of crop–resize intrinsics mismatch. In addition, MACM can be viewed as a learnable, non-parametric complement that uses metadata and image cues to compensate for residual effects that may not be fully captured by direct symbolic correction.
The affine mapping above is also the link between the known camera geometry formula and the contribution of this paper. It provides the controlled variable used in our matched/mismatched protocol and defines the metadata that MACM later receives. For a transformed coordinate , the correct pairing preserves ray geometry through , whereas a mismatched pairing produces . Thus, the issue is not the image transform alone, but the loss of synchronization between the transformed image and the camera model used at the inference time.
The main problem investigated in this paper arises when the transformed image is paired not with the updated intrinsics but with stale or uncorrected intrinsics . In that case, the model is forced to interpret an image generated under one pixel geometry using another inconsistent camera definition.
For consistency with the uncertainty formulation in
Section 2.3, we also denote the correctly updated intrinsics by
. This notation makes the matched and mismatched inference conditions explicit. Let
be a monocular metric depth estimator and let
be the ground-truth depth. The matched condition is
whereas the mismatched condition is
Because
,
, and the evaluation samples are fixed, the performance difference between these two predictions isolates the effect of image–intrinsics inconsistency. For an error metric
such as Abs.Rel, the corresponding robustness gap is written as
A smaller indicates stronger robustness to preprocessing-induced camera inconsistency. This formulation directly links the background geometry to the controlled evaluation protocol and the mitigation role of MACM.
3.2. Controlled Evaluation Protocol
To isolate the effect of geometry mismatch from model-specific and dataset-specific confounders, we design a controlled evaluation protocol in which only the consistency between image preprocessing and camera intrinsics is varied. For each image, we generate a transformed image using a predefined preprocessing operation and then compare two inference conditions. In the matched condition, the transformed image is paired with the correctly updated intrinsics . In the mismatched condition, the same transformed image is instead paired with stale or unupdated intrinsics .
The failure modes are obtained through the following controlled procedure. First, a preprocessing transform is applied to the original image. Second, the effective intrinsics are computed using the same transform. Third, the transformed image is evaluated twice: once with and once with . Since the transformed image, model, and evaluation data are fixed, the observed difference isolates the effect of image–intrinsics inconsistency.
We consider three representative mismatch settings that frequently arise in practical preprocessing pipelines: (i) resize-only mismatch, (ii) crop-only mismatch, and (iii) combined crop+resize mismatch. These settings are chosen because they correspond to common deployment patterns such as resizing raw images to a fixed network resolution, center-cropping to match aspect ratios, or applying both operations sequentially. The resize-only condition mainly perturbs focal-length scaling and tends to induce a near-global scale bias in predicted depth. The crop-only condition primarily shifts the principal point and often produces spatially structured distortion associated with an incorrect image origin. The combined condition is the most realistic one in practice, since many pipelines simultaneously alter both focal scaling and principal-point alignment.
To analyze the sensitivity of depth estimation models to the degree of inconsistency, we sweep the mismatch magnitude by varying crop ratios and resize factors over controlled ranges. This setup allows us to examine not only whether performance degrades under mismatch but also how the degradation pattern changes depending on the type and severity of the geometric inconsistency. Because all other experimental conditions are kept fixed, the resulting comparison directly reflects the effect of image–intrinsics mismatch rather than unrelated differences in optimization or architecture.
We report Abs.Rel, RMSE, and threshold accuracy as separate standard depth metrics. They are defined as
The constant follows the standard threshold-accuracy protocol widely used in monocular depth estimation. By contrast, crop ratios, resize scales, and the interpolation coefficient used later are not physical constants; they are controlled experimental parameters used to vary mismatch magnitude. Since our evaluation is single-image and static, these thresholds are not designed to model time-varying dynamics. Real-time video settings may introduce additional motion, sensor, or metadata uncertainties, which are discussed as limitations.
3.3. Proposed Mismatch-Aware Camera Module
To improve robustness against preprocessing-induced geometry inconsistency, we propose a lightweight Mismatch-Aware Camera Module (MACM). The key motivation is that the preprocessing pipeline already contains deterministic information about how the input image was geometrically transformed before entering the model. However, this information is typically discarded after the transformed image is produced. MACM explicitly reuses this metadata and conditions the feature representation on it, allowing the network to adapt its internal processing according to the effective camera geometry implied by the preprocessing.
MACM should not be interpreted as a fundamentally new feature-wise modulation operator. The affine feature adaptation below is structurally related to FiLM-like conditioning. The novelty of MACM lies in how the conditioning signal is constructed: it explicitly combines geometry-defined preprocessing metadata with image-conditioned camera cues to target crop–resize intrinsics mismatch in monocular metric depth estimation. Accordingly, the method is positioned as a mismatch-aware adapter rather than as a replacement for existing feature modulation mechanisms or camera-aware backbones. Its role is to expose preprocessing-induced geometry changes to the host depth estimator in a lightweight and reproducible form.
Let the external input be a single RGB image
I. After preprocessing, the model receives the transformed image
. At the same time, the preprocessing pipeline yields a compact metadata vector
Here, denote crop offsets, denote resize factors, denote padding offsets, and and represent the original and resized image dimensions, respectively. Together, these variables summarize how the pixel coordinate system has changed from the original image to the actual network input. The components have different geometric roles: and describe focal-length scaling, and describe crop-induced origin shifts, and describe padding-induced coordinate shifts, and image dimensions support coordinate normalization.
Given the preprocessed image
, an encoder extracts intermediate feature maps
. MACM then computes two complementary representations. The first,
, is a geometry-aware descriptor obtained by encoding the preprocessing metadata. The second,
, is an image-conditioned descriptor derived from pooled visual features. Formally, the module computes
Here, m denotes the preprocessing metadata vector, F denotes the intermediate feature map extracted from the preprocessed image, and , , and denote the metadata encoder, image feature encoder, and fusion module, respectively. In our implementation, is a lightweight multilayer perceptron (MLP) applied to the normalized metadata vector, is a projection MLP applied after global pooling of intermediate features, and is a small fusion MLP applied after concatenation. The fused token is then passed to two linear heads that generate the channel-wise scale and bias terms used for feature modulation. This design keeps the module shallow and independent of a specific backbone architecture.
For reproducibility, we specify the MACM configuration used in the controlled experiments. The metadata input is a 10-dimensional normalized vector. The metadata encoder uses a two-layer MLP with dimensions , GELU activation, and layer normalization after the hidden layer. The image branch applies global average pooling to and projects the pooled feature to a 128-dimensional token through a linear projection followed by GELU. The fusion module concatenates and and uses an MLP with dimensions . The modulation heads and are separate linear layers of size , producing channel-wise scale and bias parameters for the host feature map.
The metadata descriptor provides an explicit summary of how preprocessing altered the input geometry, whereas the image-based descriptor captures appearance-dependent context that may help interpret the practical effect of that transformation. Their fusion produces a compact camera-aware token, c.
This token is used to modulate the intermediate depth features through feature-wise affine adaptation:
Here, F denotes the intermediate feature map before adaptation, c denotes the fused camera-aware token defined above, denotes the adapted feature map after modulation, and ⊙ denotes element-wise multiplication. The functions and generate channel-wise scaling and bias terms, respectively. The adapted feature map is then passed to the original depth decoder to produce the final depth prediction .
An important characteristic of MACM is that it does not require any additional sensor stream, multi-frame input, or external calibration refinement at the inference time. The module uses information that is available in controlled preprocessing code, making it practical to integrate into existing monocular depth estimation systems. When metadata are missing in third-party or legacy pipelines, MACM requires either preprocessing logging or approximate metadata estimation; this assumption and its implications are discussed in the limitations. In this sense, MACM is not intended to replace the host model’s original camera handling mechanism; rather, it complements it by exposing preprocessing-induced geometric variation to the network in an explicit and learnable form.
The computational overhead of MACM is also limited. Since the metadata branch processes a low-dimensional vector and the image branch operates on globally pooled features, the additional cost is mainly a few MLP layers and channel-wise affine modulation. This is small compared with the encoder–decoder backbone. The consistency loss introduced below increases training-time cost because it uses two transformed views and inverse warping, but it is not required during single-image inference.
3.4. Integration with Existing Depth Models
MACM is designed as a lightweight plug-in adapter rather than a model-specific architectural redesign. The goal of this design is to enable broad applicability across different classes of monocular depth estimators while minimizing disruption to their original structure. In practice, we insert MACM between the encoder output and the depth prediction head, or at an equivalent intermediate stage in transformer-based backbones.
This placement is motivated by the fact that intermediate features retain sufficient spatial and semantic information to support depth prediction, while still being flexible enough to be modulated by camera-related conditioning. By adapting the feature distribution before decoding, MACM allows the downstream prediction head to operate on representations that are more consistent with the effective geometry of the transformed image. As a result, the host architecture can preserve most of its original parameters and design choices while gaining robustness to preprocessing-aware camera variation.
Because of this plug-in formulation, MACM can be attached to multiple model families. In explicit-camera models, the module can reduce discrepancies between the transformed image and the camera assumptions used internally by the network. In hybrid metric-depth models, it can provide an additional conditioning signal that improves camera consistency under changing preprocessing configurations. Even in calibration-agnostic or weakly camera-aware models, the module may still be beneficial because such models often encode implicit priors tied to training time field-of-view distributions, crop conventions, or image statistics. This model-agnostic design makes MACM suitable for controlled comparison across different monocular depth estimation settings.
3.5. Training Objectives
The training objective combines the original depth supervision of the host model with an additional preprocessing consistency constraint. The primary depth term, denoted by , follows the default training objective of the backbone model and may consist of regression, scale-aware, or ranking-based depth losses depending on the specific architecture. This choice preserves compatibility with the host estimator and avoids introducing a training objective tailored to only one model family.
To explicitly encourage robustness against preprocessing variation, we additionally introduce a multi-view preprocessing consistency loss. For a given image, two different preprocessing configurations,
and
, are applied, yielding two transformed inputs and their corresponding depth predictions
and
. Since both predictions originate from the same underlying scene, they should become mutually consistent once mapped back to a common coordinate frame. We therefore define
Here, and denote the depth predictions obtained under two different preprocessing configurations and , respectively. The operators and denote the corresponding inverse geometric transforms that map each prediction back to a shared reference frame, denotes the inverse warping operation, and denotes the norm. Accordingly, measures the discrepancy between the two aligned depth predictions in the common coordinate system. Because this term compares two transformed views, it introduces additional training time computation through an extra forward pass and inverse warping, but it does not alter the inference time input-output interface.
The role of this consistency term is to discourage the network from producing depth estimates that are overly dependent on a particular crop or resize configuration. Instead, it encourages the learned representation to remain stable across multiple preprocessing realizations of the same image. This is particularly important in deployment, where the exact preprocessing chain may differ from the one implicitly assumed during model development. By enforcing agreement across transformed views, the model becomes less brittle to geometry-preserving image transformations and more robust to preprocessing-induced shifts in camera interpretation.
The final objective is defined as
Here, L denotes the final training objective, denotes the original depth supervision term of the host model, denotes the preprocessing consistency loss defined above, and is a scalar weighting coefficient that controls the contribution of the consistency term. The coefficient of is implicitly fixed to one, so the objective may be read as . This is a regularized training objective rather than a convex combination of two equally weighted tasks; therefore, the weights are not required to sum to one. This formulation keeps the training setup simple while directly targeting the consistency problem that motivates this study.
4. Experimental Results
This section presents empirical results corresponding to the analysis and mitigation framework introduced in
Section 3. Rather than benchmarking full model performance alone,
Section 4.1,
Section 4.2 and
Section 4.3 are designed as controlled sensitivity analyses that isolate how different forms of crop–resize geometry mismatch affect monocular depth prediction under explicit and structured preprocessing variations. Specifically, we examine the sensitivity of depth estimation to focal-length scaling errors, principal-point shifts, and more realistic combined crop–resize mismatch conditions, including partially corrected intrinsics. These experiments are intended to quantify not only whether mismatch degrades performance but also how the degradation pattern changes depending on the geometric source and severity of the inconsistency. Based on these observations,
Section 4.4 evaluates the proposed Mismatch-Aware Camera Module (MACM) as a lightweight plug-in for improving robustness under both matched and mismatched preprocessing conditions. To address the concern that the previous evaluation was too limited,
Section 4.5 further extends the study to recent SOTA-style baselines and cross-dataset fixed-protocol validation. These additional experiments are not intended as a full leaderboard retraining study; instead, they test whether the same crop–resize-induced mismatch pattern remains visible across representative model families and image distributions.
Unless otherwise stated, the controlled MACM experiments use a fixed network input resolution of , AdamW optimization with learning rate and weight decay , batch size 8, and . Reported standard deviations are computed over repeated subset-level evaluations using the same preprocessing protocol. These settings are included to make the adapter configuration and parameter choices reproducible, while the primary purpose remains sensitivity analysis rather than exhaustive leaderboard optimization.
4.1. Sensitivity to Focal Length Errors
We first examine the effect of resize-induced mismatch on focal-length scaling, which directly changes the mapping between pixels and camera rays in the controlled evaluation protocol. Because resize operations alter the effective focal lengths of the input image, using stale intrinsics under this condition primarily induces a near-global depth scale bias, and this effect is especially visible in explicit-
K models whose predictions depend directly on calibrated ray geometry.
Figure 3 shows the qualitative resize-mismatch behavior, and
Table 1 reports the corresponding quantitative sensitivity across scale factors.
4.2. Sensitivity to Principal Point Shift
We next isolate the effect of crop-induced mismatch on principal-point alignment in order to analyze errors that arise even when focal length scaling is unchanged. When the image is cropped but the principal point
is not translated accordingly, the resulting inconsistency produces spatially varying depth distortion that is typically organized around an incorrect optical origin rather than a simple global scale shift.
Figure 4 visualizes this spatially structured error pattern, and
Table 2 summarizes the sensitivity across center-crop ratios.
4.3. Robustness Under Realistic Camera Inconsistency
We next examine a more realistic deployment regime in which crop and resize operations are applied sequentially. This combined crop+resize setting is particularly important in practice because it simultaneously introduces focal length scaling errors and principal-point misalignment, thereby affecting both global depth scale fidelity and local geometric consistency. As a result, it provides a stronger test of model robustness than the isolated resize-only or crop-only settings.
To further analyze this regime, we also consider an intermediate case in which the inference time intrinsics are neither fully mismatched nor fully corrected, but only partially adjusted toward the geometrically consistent solution. Specifically, we define an interpolated intrinsic matrix as
where
denotes the stale intrinsics,
denotes the correctly updated intrinsics, and
controls the degree of correction. In this formulation,
corresponds to the fully mismatched case,
corresponds to the geometrically matched case, and the intermediate setting in
Table 3 uses
. This controlled interpolation allows us to examine whether performance changes abruptly or progressively as the effective camera definition approaches the correct one.
Taken together, these results provide a more practical view of deployment time camera inconsistency. Rather than analyzing only fully matched or fully mismatched cases, this subsection shows how depth estimation performance evolves under more realistic conditions in which multiple preprocessing effects coexist and camera correction may be incomplete.
4.4. Mitigation via the Mismatch-Aware Camera Module
Based on the failure patterns observed in the previous subsections, we next evaluate whether the proposed Mismatch-Aware Camera Module (MACM) can directly mitigate preprocessing-induced camera inconsistency. While
Section 4.1,
Section 4.2 and
Section 4.3 establish that crop–resize intrinsics mismatch systematically degrades monocular depth estimation and that performance progressively recovers as camera consistency is restored, the key remaining question is whether such degradation can be reduced by an explicit preprocessing-aware conditioning module. To answer this question, we perform an ablation study under both matched and mismatched preprocessing settings.
Table 4 compares the baseline model, partial variants of the proposed module, the full MACM, and the full model with the additional preprocessing consistency loss. The “Meta only” variant uses only the deterministic preprocessing metadata described in
Section 3.3, whereas the “Image only” variant uses only image-conditioned camera cues derived from the intermediate feature representation. The full MACM combines both components, and the final variant further incorporates the consistency objective defined in
Section 3.5. This design allows us to isolate the contribution of each component and examine whether their combination improves robustness beyond what can be achieved by either source alone.
The results show that the baseline model exhibits a clear performance drop when the preprocessing and intrinsics are mismatched, resulting in a robustness gap of 0.038 in Abs.Rel. Adding only metadata-aware conditioning reduces this gap to 0.027, indicating that explicit knowledge of crop, resize, and padding parameters already provides useful cues for compensating preprocessing-induced geometry changes. Using only image-aware conditioning also improves robustness, reducing the gap to 0.031, which suggests that appearance-dependent camera cues extracted from the feature representation can partially capture the practical effect of geometric inconsistency. When both components are combined in the full MACM, the gap further decreases to 0.021, while the matched-setting performance is also slightly improved. This result indicates that deterministic preprocessing metadata and image-conditioned cues play complementary roles in stabilizing depth prediction under crop–resize intrinsics mismatch.
Finally, adding the proposed preprocessing consistency loss yields the smallest robustness gap of 0.017 and the strongest overall performance under mismatched settings. This suggests that enforcing agreement across differently transformed views of the same image helps the model learn representations that are less sensitive to a particular preprocessing configuration. Overall, these findings support the main claim of this paper: explicitly incorporating preprocessing-aware camera information into the network is an effective and practical strategy for mitigating deployment time crop–resize intrinsics mismatch without changing the standard single-image inference pipeline.
Here, the robustness gap is defined as the difference in Abs.Rel between the matched and mismatched settings, so that a smaller value indicates stronger robustness to preprocessing-induced inconsistency. Although the absolute gains are moderate, the robustness gap decreases from 0.038 in the baseline to 0.017 with MACM + , corresponding to an approximate 55% reduction. The mismatched Abs.Rel also improves from 0.141 to 0.114, supporting the claim that preprocessing-aware conditioning mitigates deployment time intrinsics inconsistency.
To avoid overstating this gain, we interpret the result together with the repeated subset-level variability reported in the tables. When individual paired subset scores are available, the appropriate statistical validation is a paired comparison between the baseline and MACM-based variants under the same transformed images, using a paired t-test or a non-parametric Wilcoxon signed-rank test. Because the present study is designed as a controlled sensitivity analysis rather than a large repeated-run benchmark, we treat significance testing as supplementary evidence and report the remaining need for broader repeated-run validation in the limitations.
4.5. Extended SOTA and Cross-Dataset Validation
To strengthen the experimental evaluation, we add an extended fixed-protocol validation that compares recent state-of-the-art depth estimators and tests generalization across datasets. The goal is not to claim a new leaderboard ranking but to verify whether the same image–intrinsics inconsistency appears across representative model families when the image transform is held fixed.
The extended comparison includes explicit or canonical camera-aware metric models, image-calibrated metric models, relative-to-metric hybrid models, and strong general monocular depth baselines. UniDepth [
4] and Metric3D v2 [
16] represent camera-aware metric prediction, Depth Pro [
10] and Depth Any Camera [
6] represent recent zero-shot metric depth approaches with image- or camera-generalized calibration behavior, ZoeDepth [
11] represents a relative-to-metric hybrid design, and Depth Anything V2 [
17] and DPT-Large [
2] represent strong general monocular depth baselines. All models are evaluated under the same combined crop + resize protocol. For models that explicitly consume intrinsics, the matched condition uses the updated intrinsics
, whereas the mismatched condition uses stale intrinsics
. For models that do not directly consume
K, the same transformed images are evaluated under the identical preprocessing protocol, and scale/shift alignment is used only for evaluation so that the comparison remains focused on robustness to preprocessing geometry rather than on absolute scale calibration.
Table 5 supports two observations. First, explicit camera-aware models show larger matched–mismatched gaps because stale intrinsics directly alter the ray geometry used by the model. Second, calibration-agnostic or image-calibrated models show smaller gaps, but the gap does not vanish because crop and resize still change field-of-view statistics, object scale, and spatial layout. MACM is therefore not positioned as a replacement for these SOTA models; rather, it targets the deployment consistency condition that remains necessary when image preprocessing and camera geometry are coupled.
To test whether this behavior is specific to one benchmark, we additionally apply the same combined crop + resize protocol to three validation domains: NYU-Depth v2 for indoor RGB-D scenes [
18], KITTI for outdoor driving scenes [
19], and ScanNet for indoor video-derived RGB-D scenes [
20]. No dataset-specific retraining is performed in this cross-dataset check. Instead, the host baseline and the host model with MACM +
are compared on the same transformed images in each dataset. The reported
p-values are obtained from paired
t-tests over five non-overlapping subset-level Abs.Rel scores, where each pair uses the same images and preprocessing transform. Thus, the statistical test measures whether the adapter reduces mismatched error under the same evaluation samples, not whether one dataset is easier than another.
The cross-dataset results in
Table 6 provide stronger evidence for generalization than the previous single-dataset analysis. The absolute error levels differ across datasets because indoor RGB-D, outdoor driving, and video-derived indoor scenes have different depth ranges, camera configurations, and scene statistics. Nevertheless, the direction of the effect is consistent: the mismatched Abs.Rel decreases after introducing MACM and the consistency objective, and the paired tests yield small
p-values under the fixed subset protocol. This supports the claim that crop–resize-induced intrinsics mismatch is not merely a dataset-specific artifact. At the same time, because this experiment uses fixed-protocol evaluation rather than full retraining of every SOTA model on every dataset, we interpret the result as cross-dataset robustness validation rather than as a comprehensive SOTA leaderboard.
4.6. Component-Level and Computational Validation
We further decompose the metadata vector defined in
Section 3.3 into semantically meaningful component groups. The scale group
mainly describes focal length scaling, the crop-offset group
describes principal-point displacement, the padding group
captures coordinate shifts introduced by letterboxing, and the dimension group supports normalization across image sizes. Each variant uses the same host model and the same training protocol, but removes all metadata components outside the selected group.
Table 7 reports the resulting component-level ablation.
The component ablation indicates that crop-offset metadata is particularly important under combined mismatch because principal-point displacement creates spatially structured errors that cannot be represented by focal scaling alone. Scale metadata remains useful for reducing global bias, while padding and image-dimension metadata mainly stabilize normalization across aspect ratio changes. The full metadata branch outperforms any single group, and the image-conditioned branch further improves robustness by capturing residual camera cues that are not fully described by explicit preprocessing parameters.
Finally, we report the computational cost of the proposed adapter separately from the training-only consistency loss. This separation is important because
requires two transformed views during training, whereas inference uses only a single preprocessed image and its metadata.
Table 8 summarizes the estimated computational overhead.
These additional results support the intended design of MACM as a lightweight adapter. The inference overhead is limited to low-dimensional metadata encoding, pooled-feature projection, and channel-wise affine modulation. The larger cost of the consistency objective is confined to training because it requires an additional transformed view and inverse warping. Therefore, the proposed mitigation can improve robustness to deployment time intrinsics inconsistency without changing the standard single-image inference pipeline.
5. Discussion and Limitations
The proposed formulation addresses a specific but practically important source of error: the loss of consistency between a preprocessed image and the intrinsics used during inference. This setting differs from full unknown-camera learning. In our analysis, a nominal camera matrix is available, but it becomes stale because crop, resize, or padding operations are not propagated to the effective intrinsics. In real systems, this deterministic mismatch can coexist with broader uncertainty sources such as calibration noise, missing metadata, sensor noise, temporal motion, or dataset shift. Modeling these factors jointly would require multi-dimensional uncertainty recognition and quantification beyond the controlled protocol used here.
The availability of preprocessing metadata is another practical assumption. In a controlled deployment pipeline, crop offsets, resize factors, padding offsets, and image dimensions are available directly from preprocessing code. However, third-party libraries, legacy systems, or undocumented image pipelines may not expose these values. In such cases, robust deployment requires explicit metadata logging, synchronized intrinsics updates, or approximate metadata estimation. The image-conditioned branch of MACM may provide complementary cues when metadata are incomplete, but it should not be viewed as a substitute for correct camera geometry bookkeeping.
A practical mitigation strategy is therefore to treat preprocessing as part of the camera pipeline rather than as an isolated image transform. The preprocessing function should return both the transformed image and a metadata record containing crop offsets, resize scales, padding offsets, and output dimensions. The same record can be used to update K, construct the MACM metadata vector, and audit whether inference uses a matched or stale camera definition. When metadata cannot be recovered exactly, approximate metadata estimation should be reported as an uncertainty source rather than treated as an exact correction.
Finally, the experiments remain controlled sensitivity analyses, but the revised evaluation now includes extended SOTA-style baselines, repeated subset-level standard deviations, paired statistical testing, component-level metadata ablations, and cross-dataset fixed-protocol validation. These additions strengthen the empirical support for the proposed mitigation strategy while preserving the intended scope of the paper. The remaining limitation is that the study does not perform full retraining and hyperparameter tuning for every SOTA model on every dataset, nor does it validate the method inside all possible real deployment pipelines.
6. Conclusions
In this work, we investigated crop-resize intrinsics mismatch as a practical but often overlooked source of error in monocular metric depth estimation. Our analysis showed that when image preprocessing changes the effective image geometry but camera intrinsics are not updated accordingly, depth predictions can degrade substantially even when the underlying model remains unchanged. In particular, focal length inconsistency mainly introduces global scale errors, whereas principal-point inconsistency leads to spatially structured distortions in the predicted depth. This distinction supports the need to characterize mismatch properties rather than treating all preprocessing changes as a single generic perturbation.
To address this issue, we formalized the geometric relationship between affine image preprocessing and effective camera intrinsics and showed that standard transformations such as crop, resize, and padding admit a deterministic update rule for synchronizing image geometry and calibration parameters. Based on this formulation, we designed a controlled evaluation protocol that isolates the effect of preprocessing-induced mismatch from other model- and dataset-specific factors.
We further presented the Mismatch-Aware Camera Module (MACM) as a practical mitigation strategy for cases where perfect preprocessing consistency is difficult to guarantee. By jointly exploiting metadata-aware and image-aware cues, MACM improves robustness under mismatched conditions while remaining compatible with existing camera-aware depth estimation pipelines. The ablation results further show that MACM consistently narrows the robustness gap between matched and mismatched preprocessing conditions and that the additional consistency objective provides the strongest mitigation effect. In particular, the robustness gap is reduced from 0.038 to 0.017 in the reported ablation, while mismatched Abs.Rel improves from 0.141 to 0.114. Overall, our findings emphasize that the image and its intrinsics should be treated as a coupled representation throughout the full preprocessing and inference pipeline. The extended baseline and cross-dataset validation further indicate that this consistency requirement persists across representative model families and image distributions, although broader, real-pipeline validation remains a useful direction for future work.