MDF-YOLO: A Hölder-Based Regularity-Guided Multi-Domain Fusion Detection Model for Indoor Objects

Luan, Fengkai; Yang, Jiaxing; Zhang, Hu

doi:10.3390/fractalfract9100673

Open AccessArticle

MDF-YOLO: A Hölder-Based Regularity-Guided Multi-Domain Fusion Detection Model for Indoor Objects

by

Fengkai Luan

,

Jiaxing Yang

and

Hu Zhang

^*

School of Information Engineering, Wuhan University of Technology, Wuhan 430070, China

^*

Author to whom correspondence should be addressed.

Fractal Fract. 2025, 9(10), 673; https://doi.org/10.3390/fractalfract9100673

Submission received: 2 September 2025 / Revised: 3 October 2025 / Accepted: 14 October 2025 / Published: 18 October 2025

Download

Browse Figures

Versions Notes

Abstract

With the rise of embodied agents and indoor service robots, object detection has become a critical component supporting semantic mapping, path planning, and human–robot interaction. However, indoor scenes often face challenges such as severe occlusion, large-scale variations, small and densely packed objects, and complex textures, making existing methods struggle in terms of both robustness and accuracy. This paper proposes MDF-YOLO, a multi-domain fusion detection framework based on Hölder regularity guidance. In the backbone, neck, and feature recovery stages, the framework introduces the CrossGrid Memory Block, Hölder-Based Regularity Guidance–Hierarchical Context Aggregation module, and Frequency-Guided Residual Block, achieving complementary feature modeling across the state space, spatial domain, and frequency domain. In particular, the HG-HCA module uses the Hölder regularity map as a guiding signal to balance the dynamic equilibrium between the macro and micro paths, thus achieving adaptive coordination between global consistency and local discriminability. Experimental results show that MDF-YOLO significantly outperforms mainstream detectors in metrics such as mAP@0.5, mAP@0.75, and mAP@0.5:0.95, achieving values of 0.7158, 0.6117, and 0.5814, respectively, while maintaining near real-time inference efficiency in terms of FPS and latency. Ablation studies further validate the independent and synergistic contributions of CGMB, HG-HCA, and FGRB in improving small-object detection, occlusion handling, and cross-scale robustness. This study demonstrates the potential of Hölder regularity and multi-domain fusion modeling in object detection, offering new insights for efficient visual modeling in complex indoor environments.

Keywords:

indoor object detection; Hölder regularity; multi-domain fusion; context aggregation; frequency-domain detail recovery

1. Introduction

With the rapid progress of artificial intelligence and robotics, embodied agents—intelligent systems capable of autonomously perceiving their surroundings, making decisions, and executing actions are increasingly integrated into daily life [1,2]. In indoor environments, devices such as service robots and cleaning robots are widely deployed, relying heavily on accurate visual perception to support tasks such as autonomous navigation, human–robot interaction, and scene understanding [3,4]. Within this framework, object detection plays a central role by identifying and localizing objects, thereby providing critical spatial and semantic information for downstream modules including semantic map construction [5,6], path planning [7,8], and task-level decision-making [9]. Thus, developing indoor object detection algorithms that are both highly accurate and robust is crucial for improving the overall reliability and effectiveness of embodied agents.

Nevertheless, the inherent complexity of indoor environments poses substantial challenges for object detection [10]. First, objects are often subject to severe occlusion and clutter, fragmenting their features and demanding strong global reasoning ability from models [11]. Second, the dramatic scale variation from large pieces of furniture to small tabletop items requires effective multi-scale representation and fusion [12]. Third, the recognition of small objects depends heavily on high-frequency details such as edges and textures, which are frequently degraded or lost during the down-sampling process of deep convolutional networks, leading to missed or false detections [13]. Consequently, addressing global contextual reasoning, scale adaptation, and fine-detail preservation simultaneously remains the central difficulty in indoor object detection.

To address the challenges of general-purpose object detection, detectors centered around Convolutional Neural Networks (CNNs) have undergone continuous evolution over the past decade [14]. Among these, single-stage detectors have become dominant in real-time applications due to their streamlined and efficient design. The trajectory of their research and development has consistently sought a balance between accuracy and efficiency. This evolution spans from YOLO, which unified detection as a regression problem [15], and RetinaNet, which introduced Focal Loss to effectively resolve the extreme imbalance between positive and negative samples [16], to the rise of anchor-free methods like FCOS [17]. The performance and paradigms of single-stage detectors have been perpetually innovating. Concurrently, driven by the demand for mobile and edge computing, lightweight backbones such as MobileNet and ShuffleNet have been extensively researched and applied [18], giving rise to detectors like EfficientDet, which exhibit outstanding performance on resource-constrained platforms [19]. Despite their widespread success in object detection, CNNs inherently operate within limited receptive fields, introducing the local inductive bias [20]. This means that CNNs are highly effective at capturing fine-grained local features such as edges and textures, but they struggle to model long-range dependencies between distant regions in an image—for example, recognizing the association between a table surface and its legs when they appear far apart. While this bias improves efficiency in learning local patterns, it hampers the ability of CNN-based detectors to perform holistic reasoning in complex indoor scenes, thereby reducing their robustness to severe occlusion and scale variation.

To overcome the locality bottleneck of CNNs, vision Transformer architectures, epitomized by ViT, have emerged. Their core self-attention mechanism is highly effective at capturing global context. Building upon this, the DETR framework reformulated object detection as an end-to-end set prediction problem [21]. However, the powerful global modeling capability of Transformers comes at a significant computational cost. The complexity of their self-attention mechanism grows quadratically with input resolution, leading to prohibitive training and inference overheads at high resolutions [22], making it difficult to meet the stringent real-time requirements of indoor applications.

In response to the aforementioned challenges, current research trajectories primarily encompass several key directions. On one hand, researchers are dedicated to exploring more efficient mechanisms for modeling long-range dependencies. These range from various novel attention modules to the recently prominent State Space Models (SSMs) in the vision domain [23,24], all aiming to achieve global contextual awareness while maintaining linear computational complexity. On the other hand, the academic community is investigating more sophisticated multi-scale feature interaction and fusion modules. This includes enhancing key feature representations through techniques like attention mechanisms [25] or incorporating multi-modal information, such as 3D point clouds, to aid in understanding complex spatial relationships [26]. Furthermore, approaching the problem from novel perspectives, such as the frequency domain, by explicitly compensating for features to recover details lost during network propagation, has also become an effective strategy for improving the performance of small object detection [27].

Based on the preceding analysis, this paper proposes MDF-YOLO, a multi-domain fusion object detector tailored for indoor scenes. Our objective is to systematically address the core challenges of indoor detection through a synergistically optimized solution that achieves performance breakthroughs in global context modeling, multi-scale fusion, and detail recovery, all while preserving real-time efficiency. The primary contributions of this paper are as follows:

We design a novel backbone network module named the CrossGrid Memory Block (CGMB). This module integrates state-space modeling and local convolutional pathways in parallel, capturing global context and long-range spatial dependencies with linear complexity via an orthogonal grid memory mechanism, thereby effectively addressing issues of severe occlusion.
We propose the Hölder-Based Regularity Guidance-Hierarchical Context Aggregation (HG-HCA) module. This module integrates macro- and micro-context pathways under the guidance of a Hölder-based regularity prior, where a differentiable regularity map is computed to characterize local smoothness and roughness. Through a lightweight calibration mechanism, the regularity map is transformed into task-oriented per-pixel guidance signals. This design enables the network to dynamically balance global structural consistency and local discriminative detail, thereby improving robustness against large-scale variations, complex textures, and dense occlusion in indoor scenes.
We introduce an up-sampling module named the Frequency-Guided Residual Block (FGRB). This module augments the spatial up-sampling path with a parallel frequency-domain compensation path. It enhances high-frequency components using a learnable frequency weight matrix and restores image details in a residual manner, thereby improving detection performance for small objects.

Extensive experimental results on a public indoor scene dataset demonstrate that MDF-YOLO surpasses existing mainstream methods in both accuracy and efficiency, validating the effectiveness and novelty of the proposed multi-domain fusion-based object detection network.

2. Related Work

2.1. Traditional Rule-Based and Handcrafted Feature Methods

Prior to the widespread adoption of deep learning, research in indoor object detection was predominantly characterized by sensor-driven geometric rule-based methods and computer vision approaches based on handcrafted features. Sensor-based methods utilized geometric information acquired from ultrasound, infrared, and laser rangefinders to ensure motion safety through local avoidance and path correction. The ultrasonic obstacle avoidance framework by Borenstein and Koren achieved stable avoidance under low computational constraints through continuous rangefinding and wall-following strategies, suppressing oscillatory behaviors in “trap regions” via vector field synthesis [28]. The occupancy grid map proposed by Moravec and Elfes introduced a grid-based environmental representation with a Bayesian update mechanism, enabling recursive estimation of spatial occupancy probabilities under noisy rangefinding conditions and providing a unified representation for indoor mapping and obstacle perception [29]. The probabilistic robotics framework by Thrun et al. integrated localization, mapping, and decision-making into a unified Bayesian inference framework, maintaining global consistency amidst multi-sensor noise and dynamic occlusions, thereby supporting early indoor mobile robots in reasoning about spatial accessibility and safety [30]. This line of research offered the advantages of high real-time performance and implementation simplicity at the geometric level, but it lacked the capability to characterize object categories, semantic attributes, and fine-grained boundaries, thus failing to meet the demands of subsequent indoor detection tasks.

As image sensors and computational platforms matured, the research focus shifted towards a detection paradigm composed of manually designed visual features and shallow discriminative classifiers. Lowe’s SIFT performed keypoint detection in scale-space and represented appearance using orientation-aligned local gradient histograms, demonstrating robustness to viewpoint and illumination changes and establishing the foundation for recognition based on local invariant features [31]. Bay et al.’s SURF approximated Hessian responses using integral images and box filters, substantially reducing descriptor computation costs and significantly improving real-time performance while maintaining robustness [32]. The cascaded Boosting framework by Viola and Jones utilized Haar-like features and a cascade of weak classifiers to achieve high-speed sliding-window detection, attaining near real-time performance on rigid objects like faces and propelling early embedded vision applications [33]. Dalal and Triggs’ HOG eliminated illumination effects through gradient statistics normalized within cells and blocks, enhancing pedestrian detection recall and precision via a linear SVM and becoming a standard configuration for sliding-window detection [34]. The Deformable Part Model by Felzenszwalb et al. performed discriminative training under geometric deformation constraints between a root template and part templates, significantly enhancing detection performance for non-rigid objects and under occlusion, and showing strong adaptability to intra-class variations in complex indoor layouts [35]. In subsequent work, Dalal, Triggs, and Schmid jointly modeled appearance gradients and optical flow orientation histograms, which markedly improved the robustness of human detection in video sequences under dynamic scenes, showcasing the complementary benefits of appearance and motion cues [36]. This approach could provide interpretable mid-level representations and robust discriminative boundaries on indoor images, but it still faced significant bottlenecks in cross-domain generalization, complex occlusion, and extreme scale variations.

The popularization of RGB-D sensors led to the widespread adoption of 3D geometric features in indoor object recognition and localization, with traditional methods representing and matching targets using local surface geometry or multi-modal templates. The Spin Image by Johnson and Hebert formed a 2D projection histogram from surface normals and projection distances, achieving high-recall matching of 3D parts in cluttered and partially occluded scenes and providing a reusable descriptor for rigid instance recognition in indoor environments [37]. Rusu et al.’s FPFH computed fast normal-related statistics within the k-neighborhood of a point cloud and reduced coupling term overhead, achieving an excellent efficiency-accuracy trade-off in large-scale point cloud feature matching and pose estimation tasks, well-suited for the real-time requirements of indoor recognition and registration [38]. Hinterstoisser et al.’s LINEMOD employed multi-modal template matching based on gradient orientation and surface normals to achieve rapid detection and pose estimation of texture-less objects. It maintained a stable detection rate in indoor scenes with strong occlusion, high reflectance, and cluttered backgrounds, demonstrating practical engineering applicability [39]. These 3D and multi-modal traditional methods possess unique advantages in structure recovery and pose estimation, but they still rely on meticulous design and extensive parameter tuning for cross-view generalization, template scale management, and inter-class separability.

Traditional methods based on rules and handcrafted features established a technological trajectory for indoor object detection, progressing from geometric obstacle avoidance to semantic discrimination and 3D matching, and could achieve reliable performance under controlled conditions and for specific targets. However, these methods, relying on human priors and fixed-structure representations, are prone to performance degradation when confronted with occlusion, multi-scale challenges, and appearance diversity in indoor environments, which prompted a research shift towards detection paradigms that are automatically learnable and end-to-end optimizable.

2.2. General-Purpose Deep Learning Models

Deep learning has catalyzed a paradigm shift in object detection from handcrafted features and shallow classifiers to end-to-end learnable representations. R-CNN combined region proposal generation with CNN feature learning, significantly elevating detection accuracy on general-purpose datasets and establishing a baseline for deep feature-centric detection [40]. SPP-Net extracted regional features from a shared feature map via spatial pyramid pooling, circumventing the expense of redundant forward passes over the entire image; however, its training pipeline was not fully end-to-end, a concept that laid the groundwork for subsequent unified detection frameworks [41]. Fast R-CNN introduced RoI Pooling and a multi-task loss to jointly optimize bounding box classification and regression on a shared feature map, realizing efficient end-to-end training (with the exception of region proposal generation) and striking a better balance between accuracy and efficiency [42]. Faster R-CNN, through the synergy of a Region Proposal Network and a detection network, integrated region proposal generation into the main network, thus unifying the two-stage detection paradigm and enabling joint training, which solidified its mainstream position for high-accuracy detection [43]. Mask R-CNN augmented the two-stage architecture with an additional branch for simultaneous instance segmentation, improving spatial localization precision with an aligned RoI operation while maintaining excellent extensibility [44].

In the pursuit of real-time performance, single-stage detectors replaced the region proposal mechanism with dense prediction. SSD performed direct classification and regression on multi-scale feature maps, balancing speed and accuracy to become a common baseline for mobile deployments [45]. The YOLO series formulated detection as a unified regression problem and, based on this framework, significantly improved stability and accuracy through enhancements such as anchor boxes and batch normalization [46]. It strengthened its coverage of differently sized objects via multi-scale prediction and deeper backbone networks [47], with subsequent work further refining feature aggregation and training strategies to enhance robustness and usability [48]. The issue of extreme positive–negative sample imbalance in single-stage frameworks was mitigated by the Focal Loss, which improved detection performance on small and hard examples while maintaining high efficiency [16]. In terms of model scalability, EfficientDet employed a compound scaling strategy coupled with a high-efficiency bidirectional feature pyramid, achieving an integrated design scalable from resource-constrained devices to server-side applications and demonstrating outstanding performance in the accuracy–efficiency trade-off [49].

General-purpose detectors have achieved a significant leap in performance over traditional methods on public datasets, yet their design philosophy is often geared towards generic scenarios. In indoor environments, the local inductive bias of convolution limits the ability for global reasoning on severely occluded and cluttered objects. Down-sampling and high-level abstraction lead to the attenuation of high-frequency edges and textures during propagation, thereby impairing the discriminability of small objects. Although multi-scale feature fusion can alleviate the semantic gap caused by scale variations, it remains insufficient for the extreme scale distributions found indoors, from large furniture to minute items. These issues have become recognized challenges in indoor detection, driving research to undertake targeted explorations in the directions of global context modeling, cross-scale fusion, and detail recovery [12,13].

2.3. Explorations Targeting Challenges in Indoor Scenes

While general-purpose detectors have established a strong technical baseline, their bottlenecks in indoor applications have become increasingly apparent. Three issues dominate in this setting—severe occlusion, extreme scale variation, and the loss of fine details—which have in turn stimulated extensive research along three directions: global context modeling, cross-scale semantic alignment, and detail restoration. These lines of work offer diverse technical paths for tackling the key difficulties of indoor scenes, yet they also reveal several common limitations in terms of efficiency, stability, and deployability.

In global context modeling under occlusion and crowding, relational reasoning and structural priors are widely used to compensate for the locality of convolutions. Relation Networks reweight the relationships among candidate regions to alleviate semantic loss and improve localization consistency, but region-level relation graphs incur non-negligible computational and memory overhead in dense scenes and are sensitive to the quality of candidate generation [50]. Occlusion-aware R-CNN combines aggregation losses with part visibility modeling to reduce missed detections under group occlusions; however, part-level annotations and visibility labels increase data production costs, and robustness remains limited under cross-category transfer [51]. Context-Aware CompositionalNets enhance interpretability under occlusion through part–whole compositional representations, yet they rely on relatively stable structural priors and adapt less effectively to non-rigid objects or categories with large appearance variation [52]. At the post-processing level, Soft-NMS replaces hard suppression with continuous score decay to improve recall and precision in crowded scenes, but it cannot fundamentally remedy the insufficient characterization of long-range dependencies at the feature level [53]. Overall, this line of work has direct value for complex indoor stacking, but introducing efficient long-range dependencies while preserving 2D topology remains a central challenge.

In cross-scale variation and semantic alignment, feature pyramids and scale-aware training continue to evolve to cover the span from large furniture to tiny devices. ASFF learns spatially adaptive fusion weights to ease spatial misalignment across feature levels and can enhance scale robustness with low additional cost, yet it still tends to favor particular levels under extremely long-tailed scale distributions [54]. NAS-FPN discovers efficient topologies in the space of cross-level connections via neural architecture search, unifying top-down and bottom-up information flows; however, search costs and hardware dependencies are substantial, and practical deployment requires rebalancing complexity and gains across platforms [55]. TridentNet employs multi-branch convolutional paths with different receptive fields and scale-aware training to improve stability at extreme scales; although branches can be pruned at inference to control cost, multi-branch training still brings significant memory and optimization burdens [56]. At the proposal stage, Guided Anchoring reduces invalid candidates with learnable priors and position guidance and is more friendly to objects with diverse aspect ratios, but dependence on anchor assumptions and data distributions persists [57]. In summary, efficient and stable cross-level semantic alignment remains challenging on indoor datasets with long-tailed scales and diverse shapes, and the balance between complexity and benefit is not consistent across resolutions and hardware.

In detail preservation and small-object visibility, upsampling operators and frequency-domain attention are used to restore high-frequency textures and edges, and they complement geometric modalities. CARAFE adopts content-aware reassembly kernels that are more favorable to boundaries and textures than fixed interpolation or transposed convolution, but the extra operators and buffering required for kernel prediction must be carefully evaluated for real-time scenarios [58]. FcaNet generalizes channel attention compression to the frequency domain and introduces multi-spectrum attention to enhance high-frequency responses for detail-sensitive categories; however, alignment between frequency enhancement and task features still relies on heuristic design, and controlling edge artifacts and ringing effects requires additional regularization and implementation care [59]. Along the multimodal route, VoteNet and its extensions perform 3D detection via spatial voting in point clouds and hybrid geometric primitives; ImVoteNet aligns image-space votes with camera parameters to improve detection of occluded and small objects; H3DNet and 3DETR demonstrate competitiveness under 3D priors and end-to-end Transformer designs [60,61,62,63]; Deep Sliding Shapes earlier established a feasible paradigm for amodal 3D object detection in RGB-D scenes through voxelized 3D convolutions and a 3D RPN [64]. This route shows clear advantages under indoor conditions such as low light, reflective materials, and near-field occlusion, yet in practice it imposes higher requirements on depth-sensor availability, cross-modal calibration, and inference latency, limiting deployment on some platforms.

Beyond the above threads, explicit modeling of local regularity has attracted growing attention. Differentiable measures of local regularity can provide a complementary signal to saliency or attention mechanisms by distinguishing texture-smooth regions from edge/detail-rich regions, and by offering pixel-wise lightweight references for global consistency constraints and the strengthening of local discrimination. Compared with data-driven attention learning, such regularity priors supply interpretable spatial guidance at low additional cost, and show potential in indoor scenes where occlusion, scale variation, and competition for high-frequency details coexist [65]. Although existing work offers effective paths for global context, cross-scale fusion, and multi-domain detail restoration, several challenges remain prevalent: first, introducing long-range dependencies often conflicts with preserving 2D topology, and computational and memory costs grow rapidly with scene density; second, cross-level semantic alignment lacks efficiency and stability under long-tailed scales and diverse shapes, and the complexity–benefit trade-off is hard to standardize across platforms; third, systematic approaches to aligning high-frequency detail compensation with task features are still lacking, and frequency enhancement or improved upsampling is sensitive to artifact control and implementation details; fourth, multimodal and 3D solutions face higher barriers in device availability and deployment latency. Based on these shared gaps, subsequent work needs to coordinate global modeling, cross-scale alignment, and detail restoration within a unified and efficient framework, and introduce lightweight and learnable local-regularity priors as spatial guidance, so as to improve robustness and generalization in indoor scenes while maintaining real-time performance.

3. Methodology

3.1. MDF-YOLO Network Architecture

Indoor object detection tasks, particularly under severe occlusion, large-scale variations, and dense distributions of small objects, impose higher demands on the multi-level feature modeling capacity of detectors. In recent years, the YOLO family of detectors has become the standard for real-time detection owing to its single-stage design, end-to-end regression paradigm, and well-recognized balance between speed and accuracy. Among them, YOLOv8 features a mature modular design of backbone, neck, and detection head, achieving a strong trade-off between accuracy and inference efficiency on generic benchmarks, while also offering clear advantages in engineering deployment and lightweight extensibility. Consequently, it has been widely adopted as the baseline architecture for subsequent improvements. Despite its competitive performance on large-scale datasets, YOLOv8 still relies heavily on convolutional operations in the backbone, limiting its ability to capture long-range dependencies; its neck fusion strategy lacks a dynamic balance between macro- and micro-contexts; and its up-sampling operators often incur a loss of high-frequency details during feature recovery. To address these interrelated challenges, we propose Multi-Domain Fusion YOLO (MDF-YOLO), which introduces the CGMB, HG-HCA module, and FGRB into the backbone, neck, and high-resolution recovery stages, respectively. Together, these modules establish complementary feature modeling across the state-space, spatial, and frequency domains. This design enhances global–local collaborative reasoning and ensures consistency between structural layouts and fine-grained details, outperforming alternative approaches that rely on a single paradigm by achieving robust and accurate detection under real-time constraints. The overall architecture is illustrated in Figure 1.

At the Backbone stage, the network is embedded with the CrossGrid Memory Block (CGMB). This module unfolds the input feature map into one-dimensional sequences along orthogonal directions and employs the Mamba state-space model to achieve long-range dependency modeling, while concurrently retaining a local convolutional path to extract fine-grained textures. The Orthogonal Grid Memory Modeling (OGMM) mechanism can capture complementary pixel relationships in both horizontal and vertical directions and explicitly incorporates spatial information through a position enhancement layer, thereby maintaining semantic consistency and strong spatial awareness at both local and global levels.

In the Neck stage, we introduce the Hölder-Based Regularity Guidance–Hierarchical Context Aggregation (HG-HCA) Block. Unlike traditional, purely data-driven fusion methods, the HG-HCA Block is dynamically modulated on a per-pixel basis by a multifractal prior in both its macro-context aggregation path and its micro-context refinement path. Its core lies in the Hölder prior head (HP-head), which performs a lightweight, differentiable approximation of the Hölder exponent on the input features to generate a local regularity map. This map characterizes the smoothness and roughness of different regions and is transformed into a task-relevant guidance signal via a learnable calibration layer. The macro-path leverages this guidance to strengthen long-range dependency modeling in regular regions, maintaining the consistency of the overall scene layout. Conversely, the micro-path, through directionally sensitive convolution and a dual-attention mechanism, enhances the expression of details in texturally complex regions under the modulation of the fractal prior. Finally, the outputs of both paths undergo deep feature reorganization and residual fusion via a C2f module, achieving a dynamic balance between global and local features.

In the up-sampling and high-resolution feature generation stage, the network employs the Frequency-Guided Residual Block (FGRB). This module comprises a spatial up-sampling path and a frequency-domain compensation path. The main path preserves the global structure through bilinear interpolation and convolution. The frequency path maps features to the frequency domain, enhances high-frequency components using a learnable frequency weight matrix, and restores details via transposed convolution, before being finally fused with the main path in a residual manner. This design effectively compensates for the high-frequency information loss caused by interpolation during up-sampling, providing a richer feature representation for small objects and texturally complex regions.

Through synergistic optimization across the state-space domain (CGMB), the Hölder-based prior-guided spatial domain (HG-HCA), and the frequency domain (FGRB), MDF-YOLO establishes a complementary mechanism for feature modeling and fusion. This enables the network to simultaneously maintain strong robustness and high detection accuracy in complex indoor environments.

3.2. CrossGrid Memory Block

In indoor object detection, targets are often subject to severe occlusion and complex stacking, where local textures and geometric boundaries are easily damaged or weakened during feature extraction. This requires detectors to simultaneously capture both local details and long-range dependencies. Traditional convolutional neural networks rely on limited receptive fields and thus face inherent limitations in modeling global spatial relationships, often leading to missed or false detections under occlusion and incomplete observations. Existing State Space Models (SSMs) in vision tasks typically flatten the input and map a two-dimensional image into a one-dimensional sequence for processing. However, this operation destroys the original spatial topology, making it easy to lose spatial consistency in complex indoor layouts, thereby limiting their applicability in detection tasks.

Based on these considerations, we have designed the CrossGrid Memory Block (CGMB), which aims to preserve both the spatial inductive bias of local convolutions and the long-range dependency awareness of state-space modeling, without sacrificing computational efficiency. The core principle of this module is to explicitly maintain the 2D topological structure through an Orthogonal Grid Memory Modeling (OGMM) mechanism, which explicitly maintains two-dimensional topology by unfolding the input features along horizontal and vertical complementary directions. Cosine positional encodings are introduced into these sequences to enhance positional information, thereby preserving pixel adjacency and directional consistency on a global scale. Concurrently, a local convolutional path supplements fine-grained features such as textures and edges. This allows the overall representation to encompass both local details and global semantics, providing a more robust feature foundation for subsequent feature fusion and detection. The operational schematic of the CrossGrid Memory Block is depicted in Figure 2.

The entire module receives an input feature map

F_{i n} \in ℝ^{C \times H \times W}

, where C, H and W denote the number of channels, height, and width of the feature map, respectively. The input is first dispatched into two complementary sub-paths:

In the local modeling path (Local Path), the input feature is processed by a lightweight C2f module to extract fine-grained details such as edges and textures:

F_{l o c a l} = C 2 f (F_{i n})

(1)

In the global modeling path (OGMM Path), we construct a spatial topology-enhanced branch based on state space modeling [23]. The input feature map is projected along horizontal and vertical directions into two one-dimensional sequences,

S_{h}

and

S_{v}

, both of which preserve pixel adjacency in the two-dimensional grid. To avoid the loss of spatial information during unfolding, cosine positional encodings [66], denoted as

P E_{h}

and

P E_{v}

, are added to each sequence to explicitly embed spatial position information. These sequences are then processed by independent Mamba Blocks to perform selective state space modeling and capture long-range contextual dependencies in their respective directions. Finally, the outputs are remapped back to 2D space and fused via element-wise summation:

F_{g l o b a l} = U n f l a t t e n [M a m b a_{h} (S_{h} + P E_{h})] + U n f l a t t e n [M a m b a_{v} (S_{v} + P E_{v})]

(2)

Compared with the traditional “flatten-and-remodel” approach, this orthogonal fusion strategy preserves linear complexity while avoiding the destruction of two-dimensional topology.

After the two aforementioned paths complete their respective modeling, the feature fusion stage integrates the local features

F_{l o c a l}

and global features

F_{g l o b a l}

through element-wise operations. These are then passed through a residual connection along with the original input

F_{i n}

to further ensure gradient propagation and information integrity. The fused result is subsequently fed into a standard convolutional module for channel control and non-linear transformation. The final output feature representation of the entire module is given by

F_{o u t} = C o n v (F_{i n} + F_{l o c a l} + F_{g l o b a l})

(3)

Compared with traditional SSMs that flatten and remodel sequences, CGMB maintains two-dimensional topology. Compared with single-path convolutional or state space modeling, its dual-path complementarity and residual fusion enhance feature completeness and robustness. Consequently, CGMB demonstrates higher interpretability and adaptability in indoor scenes with severe occlusion and complex stacking, while maintaining strong plug-and-play flexibility and generalizability.

3.3. Hölder-Based Regularity Guidance–Hierarchical Context Aggregation Block

Indoor object detection often encounters challenges such as large-scale variations, densely distributed targets, frequent occlusions, and irregular texture and boundary distributions. This requires the detector, at the feature fusion stage, to not only maintain the consistency of global structural information but also emphasize fine-grained features critical for recognizing small objects. Conventional feature pyramid structures can provide cross-scale alignment, but they lack a dynamic balancing mechanism between global and local contexts. To address this, we propose the Hölder-Based Regularity Guidance–Hierarchical Context Aggregation Block (HG-HCA), which introduces a local regularity map to adaptively modulate the contributions of macro- and micro-context pathways, thereby achieving a self-adjusting balance between global structural modeling and local discriminative enhancement. The overall architecture is illustrated in Figure 3.

The core of HG-HCA is the Hölder regularity prior. Given an input feature map

F_{i n} \in ℝ^{C \times H \times W}

, a lightweight Hölder prior head (HP-head) generates a pixel-wise local regularity map

\tilde{α} (x)

, which characterizes the smoothness and roughness of local regions. Specifically, three sets of depthwise separable convolutions with different receptive fields are used to simulate multi-scale responses. These responses undergo local energy aggregation and logarithmic transformation to approximate wavelet energy spectra, thereby yielding regularity measurements. A 1 × 1 convolution is then applied to linearly combine the multi-scale responses and fit the relationship between scale and energy, producing a regularity map correlated with the Hölder exponent. For stability, the map is further processed with smoothing convolution and instance normalization, followed by a learnable calibration layer to generate a task-related guidance signal:

α^{*} (x) = σ (γ (\tilde{α} (x) - τ))

(4)

where

γ

and

τ

are learnable parameters, and

σ (\cdot)

denotes the sigmoid activation function. This design allows the network to dynamically learn how the smoothness or roughness of local regions should influence the balance between global and local features in accordance with the task objectives.

To enable dynamic adjustment, the calibrated regularity map

α^{*} (x)

is further mapped into mutually exclusive weights for the macro and micro pathways. We employ a softmax function to generate the two weights as follows:

w_{m a c r o} (x) = \frac{\exp (γ_{m} (α^{*} (x) - τ_{m}))}{\exp (γ_{m} (α^{*} (x) - τ_{m})) + \exp (γ_{μ} (1 - α^{*} (x) - τ_{μ}))}

(5)

w_{m i c r o} (x) = \frac{\exp (γ_{μ} (1 - α^{*} (x) - τ_{μ}))}{\exp (γ_{m} (α^{*} (x) - τ_{m})) + \exp (γ_{μ} (1 - α^{*} (x) - τ_{μ}))}

(6)

where

(γ_{m}, τ_{m})

and

(γ_{μ}, τ_{μ})

are two independent sets of learnable parameters. By definition,

w_{m a c r o} (x) + w_{m i c r o} (x) = 1

, ensuring mutual exclusivity of the two weights at the pixel level. As a result, pixels in highly regular and smooth regions strengthen the macro-pathway contribution, while those in irregular and texture-rich regions rely more on the micro-pathway, thus achieving a dynamic balance across different spatial areas.

In the macro-context pathway, the main objective is to efficiently model long-range dependencies across pixels. To this end, we introduce a reversible reordering operator

R^{(k)}

, which groups and rearranges features along the height or width dimension so that distant elements are mapped to adjacent positions, allowing standard convolutions to capture long-range interactions directly. Specifically, given an input

x \in ℝ^{C \times H \times W}

, the operator

R_{w}^{(k)}

rearranges the width dimension to obtain

x \in ℝ^{\frac{C}{k} \times H \times k W}

. A

3 \times 3

convolution is applied, followed by

R_{w}^{(k) - 1}

to restore the original spatial shape

x \in ℝ^{C \times H \times W}

:

{\hat{F}}_{w}^{(k)} = R_{w}^{(k) - 1} ({Conv}_{3 \times 3} (R_{w}^{(k)} (F_{i n})))

(7)

The resulting feature map

{\hat{F}}_{w}^{(k)}

maintains the same dimensions as the input. A similar procedure is applied in the height dimension using

R_{h}^{(k)}

and

R_{h}^{(k) - 1}

, yielding

{\hat{F}}_{w h}^{(k)} = R_{h}^{(k) - 1} ({Conv}_{3 \times 3} (R_{h}^{(k)} ({\hat{F}}_{w}^{(k)})))

(8)

In this way, distant pixels along the vertical direction are also mapped to local neighborhoods, enabling efficient long-range dependency modeling while preserving resolution. In this work, we set k = 2, 4 to capture both mid-range and long-range relationships in horizontal and vertical directions. Compared with directly applying large convolution kernels, this reordering–convolution–inverse reordering mechanism achieves more efficient long-range modeling without altering the resolution. Combined with the macro-pathway weights, the final macro-pathway output is

F_{m a c r o} = F_{i n} + w_{m a c r o} \otimes {Conv}_{1 \times 1} ([{\hat{F}}_{w h}^{(2)}; {\hat{F}}_{w h}^{(4)}])

(9)

where

[\cdot; \cdot]

denotes channel concatenation and ⊗ denotes element-wise multiplication. This mechanism enhances macro-context contributions in regular regions to preserve global consistency, while suppressing them in texture-rich regions to avoid unnecessary mixing of distant features.

Complementary to the macro-path, the Micro-Context Refinement Path aims to enhance high-frequency details and local salient patterns within the image, enabling the model to more accurately identify object edges, textures, and small targets. This path first models local structures using directionally sensitive asymmetric convolutions on the input feature, formulated as

F_{l o c a l} = {Conv}_{3 \times 7} (F_{m a c r o}) + {Conv}_{7 \times 3} (F_{m a c r o})

(10)

The combination of convolutional kernels with different orientations allows the network to simultaneously capture horizontal and vertical texture patterns, providing a more discriminative base representation for the subsequent attention mechanisms.

The combination of horizontal and vertical filters captures texture patterns in both directions, providing a more discriminative basis for subsequent attention mechanisms. In the channel attention stage, a weighted pooling strategy is employed using

w_{m i c r o}

to form channel descriptors:

z_{c} = \frac{\sum_{i, j} w_{m i c r o} (i, j) \cdot F_{l o c a l} (c, i, j)}{\sum_{i, j} w_{m i c r o} (i, j) + ϵ}

(11)

where

c

is the channel index,

(i, j)

represents the spatial location, and

ϵ

is a stability term. Weighted pooling ensures that texturally complex regions have a higher contribution to the channel statistics, thereby elevating the importance of relevant channels in the subsequent attention allocation. The channel descriptor vector then undergoes feature compression and restoration via dimensionality-reducing and dimensionality-increasing convolutions:

l_{c} = W_{2} δ (W_{1} z_{c})

(12)

where

W_{1} \in R^{\frac{C}{r} \times C}

and

W_{2} \in R^{C \times \frac{C}{r}}

with scaling ratio

r

, and

δ (\cdot)

is the ReLU activation function.

Building on this, we introduce the concept of logit-space affine modulation, applying the micro-weights directly to the pre-activation stage of the attention to achieve more stable gradient propagation. The modulated channel attention logits are represented as

l_{c}^{'} = l_{c} + β_{c} \cdot {\bar{w}}_{m i c r o}

(13)

where

β_{c}

is a learnable scalar and

{\bar{w}}_{m i c r o}

is the global average of the micro-weights. Finally, the channel attention weights are

W_{c} = σ (l_{c}^{'})

(14)

Subsequently, the channel attention weights are multiplied channel-wise with the features to obtain the channel-enhanced features:

F_{c} = F_{l o c a l} \otimes W_{c}

(15)

This design allows high-frequency features in less regular regions to be more prominently selected in the channel dimension, while more regular regions maintain a relatively stable response.

In the spatial attention stage, the goal is to identify which locations on the image plane require further emphasis. To this end, we first perform channel-wise average pooling and max pooling on the channel-enhanced feature map

F_{c}

to obtain two single-channel saliency maps. These are concatenated with the micro-weights

w_{m i c r o}

and passed through a

7 \times 7

convolution to generate the pre-activation features for spatial attention:

l_{s} = {Conv}_{7 \times 7} ([{AvgPool}_{C} (F_{c}); {MaxPool}_{C} (F_{c}); w_{m i c r o}])

(16)

Affine modulation is then used to further strengthen the representation of rough regions in the attention map:

l_{s}^{'} = l_{s} + β_{s} \cdot w_{m i c r o}

(17)

where

β_{s}

is a learnable modulation coefficient. After obtaining the spatial attention weights, they are applied to the channel-enhanced features to produce the output of the micro-path:

F_{m i c r o} = F_{c} \otimes σ (l_{s}^{'})

(18)

where

σ (\cdot)

denotes the sigmoid activation function, finally, the micro-pathway output

F_{m i c r o}

is passed through a C2f module for deeper feature recombination, then combined with the original input via residual connection and adjusted by a 1 × 1 convolution:

F_{o u t} = {Conv}_{1 \times 1} (F_{i n} + C 2 f (F_{m i c r o}))

(19)

Through this design, the HG-HCA module leverages Hölder-based regularity maps as pixel-wise priors to dynamically balance global and local contributions. The macro pathway ensures efficient modeling of structural consistency, while the micro pathway emphasizes fine-grained discriminative details. Their integration via residual and weight modulation enhances robustness in indoor scenes with large-scale variations, dense occlusions, and complex textures, while also ensuring good interpretability and extensibility.

3.4. Frequency-Guided Residual Block

In the task of indoor object detection, models must often strike a balance between deep semantic features and shallow spatial details. High-frequency information, such as edges and textures, is particularly crucial for maintaining discriminability, especially in the recognition of small objects and the differentiation of occluded ones. However, existing detection frameworks commonly rely on up-sampling in the Neck stage to restore spatial resolution. While computationally efficient, prevalent methods like nearest-neighbor or bilinear interpolation inevitably introduce a smoothing effect when enlarging feature maps, thereby attenuating the expression of high-frequency details. This phenomenon often results in the blurring of small objects and intricate details after up-sampling, further exacerbating the already prominent issue of detail loss in indoor scenes.

To mitigate this limitation, this paper proposes the Frequency-Guided Residual Block (FGRB). The design objective of this module is to selectively recover and enhance high-frequency details while maintaining the stability of the global structure, thereby improving the representation of small objects and local textures. The core concept of FGRB is to model the conventional spatial up-sampling path in parallel with a frequency-domain compensation path: the main path is responsible for providing smooth and stable low-frequency structural information, while the auxiliary path recovers edge and texture details through frequency analysis and compensation. The two are ultimately fused in the residual domain. This design transforms the up-sampling process from mere spatial expansion into a synergistic optimization that considers both global structure and local detail, providing more robust feature support for small objects and complex scenes in indoor detection. Its structure is illustrated in Figure 4.

The FGRB module takes a low-resolution feature map

F_{i n} \in R^{C \times H \times W}

from the deeper layers of the Neck as input. Internally, the module consists of two parallel branches: a spatial up-sampling path and a frequency-domain compensation path, responsible for preserving global structure and recovering local details, respectively. The spatial up-sampling path first performs bilinear interpolation on the input feature, doubling its spatial resolution to obtain

U_{\times 2} (F_{i n}) \in R^{C \times 2 H \times 2 W}

. To perform linear recombination in the channel dimension and suppress redundant information, this path then introduces a

1 \times 1

convolutional operation, transforming the interpolated feature into

F_{m a i n} \in R^{C^{'} \times 2 H \times 2 W}

. This process enhances the representation of principal low-frequency components while preserving global structural information. The frequency-domain compensation path focuses on detail recovery. First, a 2D Fast Fourier Transform (FFT) is independently applied to each channel of the input feature

F_{i n}

:

S = 2 D - FFT (F_{i n}) \in C^{C \times H \times W}

(20)

This transforms the features from the spatial domain to a frequency-domain representation, where coefficients in the central region correspond to low-frequency components and those in the surrounding regions correspond to high-frequency components. To selectively enhance detail information in the frequency domain, a learnable frequency weight matrix

W_{f r e q} \in C^{C \times H \times W}

is introduced. This matrix gradually learns to selectively preserve or suppress frequency components at different locations during training. The enhanced frequency-domain features are obtained through an element-wise Hadamard product:

S_{e n h} = S ⊙ W_{f r e q}

(21)

This process allows the network to flexibly adjust the weights of details such as edges and textures in the frequency domain, achieving adaptive high-frequency compensation. Subsequently, the enhanced frequency-domain features are transformed back to the spatial domain via a 2D inverse FFT:

F_{c o m p} = {2 D - FFT}^{- 1} (S_{e n h})

(22)

This yields a low-resolution detail-compensated feature map. To match the spatial resolution of the main path, a transposed convolution

T C o n v_{k, 2}

with a stride of 2 is used to up-sample

F_{c o m p}

, generating

F_{c o m p ↑} \in R^{C \times 2 H \times 2 W}

. The transposed convolution not only serves to spatially enlarge the feature map but also meticulously “paints” the recovered textures and edges during the up-sampling process through its learnable kernel, enabling a more precise alignment of the compensated features with the spatial grid.

The outputs of the two paths are fused in the residual domain:

F_{f u s e} = F_{m a i n} + F_{c o m p ↑}

(23)

This fusion method effectively introduces the recovered detail signals from the compensation path while preserving the original structural information. The residual form also contributes to the stability of the optimization process. Following the fusion, a

1 \times 1

convolutional layer is further introduced, accompanied by batch normalization and a SiLU activation function:

F_{o u t} = SiLU (BN ({Conv}_{1 \times 1} (F_{f u s e})))

(24)

This is used to perform a weighted combination and recalibration of the fused features along the channel dimension, thereby suppressing redundant features and enhancing key ones. This step not only improves the discriminative capability of the fused result but also allows for flexible control over the number of output channels, ensuring seamless integration with subsequent detection head structures.

The final high-resolution feature map output

F_{o u t}

, possesses both a stable global structural representation and rich detail textures, providing a significant performance boost in tasks sensitive to small objects and details. This design, which operates through the synergy of frequency-domain compensation and spatial up-sampling, enables the module to more comprehensively preserve multi-scale information, thereby providing high-quality feature support for subsequent detection.

4. Experiments

4.1. Experimental Setup

To validate the effectiveness and applicability of the proposed method, this study established a multi-dimensional experimental verification platform. The hardware configuration comprised a high-performance computing platform equipped with an NVIDIA RTX 4090 GPU, an AMD R9 9900X CPU, and 64GB of DDR5 6000 MHz memory, capable of supporting large-scale parallel data processing and model training requirements. The software environment was based on the PyTorch 2.2.1 deep learning framework, in conjunction with CUDA 12.7, to enable high-efficiency acceleration of GPU operators and ensure hardware driver compatibility and operational stability. To ensure the stability and convergence efficiency of the training process, this study employed the SGD optimizer and incorporated a warm-up mechanism to mitigate gradient oscillations during the initial stages of training.

This study utilized the Furniture Detection v20 [67] dataset as the core data foundation for the experiments. During the preprocessing stage, the original data underwent automatic rotation correction and size normalization. Data augmentation techniques, including rotation and cropping, were also applied to enhance the diversity and robustness of the samples. However, the original dataset exhibited certain deficiencies in class distribution and annotation format; some classes were sparsely populated, and some mask annotations were unsuitable for the detection task. To address these issues, this study systematically reprocessed the dataset. This included merging semantic categories, converting masks to bounding boxes, and unifying label formats, thereby constructing a high-quality version more suitable for the object detection task. Sample images from the dataset are shown in Figure 5.

The dataset contains a total of 6956 images, which were partitioned into training, validation, and test sets according to a 7:2:1 ratio. All images were resized to 416 × 416, maintaining a consistent aspect ratio. The bounding box areas exhibit a wide span, covering both large furniture that occupies a significant portion of the visual field and smaller objects or partial structures. In terms of composition, the dataset comprises 11 categories, encompassing typical indoor objects such as bedding, storage furniture, lighting fixtures, and decorative components. The class distribution displays a significant imbalance: the largest class, ‘Sofa’, accounts for 26.97% of instances, whereas the smallest class, ‘Nightstand’, constitutes only 2.89%, representing a nearly 9.3-fold difference in the number of instances. This long-tail distribution is consistent with the patterns observed in real-world indoor environments but also places higher demands on the model’s ability to learn from few-shot categories. The specific class distribution details are provided in Table 1.

In the model performance evaluation phase, we adopt core metrics covering the two dimensions of detection accuracy and detection efficiency to comprehensively measure the performance of different methods on the indoor object detection task. In terms of accuracy, we use Mean Average Precision (mAP) as the primary evaluation metric. mAP is calculated based on Intersection over Union (IoU), which measures the degree of overlap between the model’s predicted bounding box and the ground-truth bounding box, defined as the ratio of the intersection to the union of the two boxes. For a single class c, given an IoU threshold τ, its Average Precision

A P_{c} (τ)

can be obtained by integrating the Precision-Recall (P-R) curve. The relationship is expressed as

P_{c} (R| τ)

, leading to the formula:

A P_{c} (τ) = \int_{0}^{1} P_{c} (R| τ) d R

(25)

Here,

P = \frac{TP}{TP + FP}

and

R = \frac{TP}{TP + FN}

, where TP, FP, and FN represent the number of true positive, false positive, and false negative predictions, respectively. The final mAP metric is obtained by calculating the mean of the AP values across all classes:

m A P (τ) = \frac{1}{N_{c l a s s e s}} \sum_{c = 1}^{N_{c l a s s e s}} A P_{c} (τ)

(26)

Common evaluation metrics under different thresholds include:

mAP@0.5: The mean Average Precision at an IoU threshold of 0.5, representing performance under a looser matching condition between predicted and ground-truth boxes.
mAP@0.75: The mean Average Precision at an IoU threshold of 0.75, which more strictly reflects the localization accuracy of the bounding boxes.
mAP@0.5:0.95: The result averaged over IoU thresholds from 0.5 to 0.95 with a step size of 0.05, constituting a comprehensive measurement of detection performance.

In terms of detection efficiency, we use two metrics to describe efficiency: Frames Per Second (FPS) and per-image latency (ms/img). The per-image latency can be broken down into three components: preprocessing (pre), inference (infer), and post-processing (post), with the total being

L a t e n c y = T_{p r e} + T_{i n f e r} + T_{p o s t}

(27)

In the actual operation of the model,

T_{p r e}

denotes the time for preprocessing steps like image reading and normalization,

T_{i n f e r}

is the time consumed by the deep neural network’s forward pass, and

T_{p o s t}

includes steps such as Non-Maximum Suppression (NMS) and results decoding. FPS is calculated as the ratio of the total number of images in the test set, N, to the total inference time,

T_{t o t a l}

. FPS is also the reciprocal of latency and represents the number of images the model can process per unit of time:

F P S = \frac{N}{T_{t o t a l}} = \frac{1}{L a t e n c y}

(28)

By combining these two dimensions of accuracy and efficiency, this paper can comprehensively evaluate the model’s ability to detect small objects, handle complex occlusions, and maintain real-time performance, providing a solid foundation for subsequent comparative experiments and the validation of proposed improvements.

4.2. Experimental Results and Comparison

To comprehensively validate the effectiveness of the proposed method, this paper conducted a comparative analysis against several current mainstream object detectors on the Furniture Detection v20 dataset. The experimental results are presented in Table 2 and Figure 6. In the table, the best-performing results are marked in bold, and the second-best results are underlined. As can be observed from the table, MDF-YOLO achieved the highest performance across all accuracy-related evaluation metrics. The overall results indicate that the proposed multi-domain fusion mechanism can significantly enhance the accuracy and robustness of detection in complex indoor scenes.

Specifically, on the mAP@0.5 metric, MDF-YOLO achieved scores of 0.7158 on the validation set and 0.6803 on the test set, which are the highest values among all compared methods, demonstrating a significant improvement over other detectors. On the more stringent mAP@0.75 metric, MDF-YOLO also achieved scores of 0.5814 and 0.5615, surpassing the other methods. This result highlights the advantages of our designed model in multi-scale context fusion, enabling the detector to better capture semantic correlations in structurally complex indoor scenes and thereby enhancing its discriminative capability for furniture objects with ambiguous boundaries and large-scale variations. Furthermore, on the comprehensive mAP@0.5:0.95 metric, MDF-YOLO’s performance reached 0.6117 and 0.6266, which was the most outstanding, maintaining stable accuracy across different IoU thresholds. Through hierarchical attention allocation, the model can effectively model long-range dependencies, preventing the loss of semantic information when detecting objects in large scenes. This allows MDF-YOLO to maintain stable recognition of large objects while improving its detection capability for small objects and fine-grained decorative items.

In terms of inference efficiency, YOLOv10 and YOLOv11 demonstrated higher frame rates and lower latency. In contrast, MDF-YOLO achieved a frame rate of 354.6 FPS with a latency of 2.82 ms, slightly slower than these two methods. However, when considering both accuracy and efficiency, MDF-YOLO still delivers the optimal detection accuracy while operating at a high inference speed. This balance between accuracy and speed indicates that our proposed model avoids introducing redundant computations while enhancing features for small objects, allowing it to maintain high detection performance within a limited computational budget.

To further validate the model’s effectiveness intuitively, Figure 6 displays the detection results of different methods in typical indoor scenes. The results show that MDF-YOLO exhibits superior robustness in multi-class object recognition. For the detection of large objects such as sofas, beds, and closets, MDF-YOLO generates bounding boxes that more accurately conform to the actual object boundaries. For small objects like nightstands and wall lamps, the detection results of MDF-YOLO are nearly identical to the ground-truth annotations, whereas other methods commonly suffer from missed detections or bounding box offsets. Moreover, MDF-YOLO maintains stable detection performance even in scenes with significant occlusion, which further validates the effectiveness of the multi-domain fusion strategy in complex environments.

When benchmarked on the Furniture Detection v20 dataset, MDF-YOLO surpasses existing methods in accuracy while maintaining a practical inference speed. Through the synergistic action of the CGMB, HCA, and FGRB modules, the model achieves comprehensive improvements in feature representation, context modeling, and small object enhancement, thereby demonstrating optimal overall performance on the indoor furniture detection task.

4.3. Ablation Study

To systematically analyze the individual efficacy of each core component within the MDF-YOLO architecture and to verify their synergistic effects, we conducted a series of comprehensive ablation studies using YOLOv8 as the baseline model. These experiments were performed under identical training strategies and data configurations. We progressively integrated the cross-layer global modeling module (CGMB), the Hölder-Based Regularity Guidance–Hierarchical Context Aggregation Block (HG-HCA Block), and the frequency-guided reconstruction module (FGRB) into the baseline architecture, either individually or in combination. The quantitative results are summarized in Table 3, where the best results are marked in bold and the second-best are underlined. To provide clarity on the values in parentheses in Table 3, the numbers represent the performance difference between the current configuration and the baseline model.

The experimental results reveal the contributions of each module. In the validation of single-module gains, individually introducing the CGM Block increased the mAP@0.5 on the validation and test sets from 0.6904/0.6674 to 0.7021/0.6728. This improvement validates the effectiveness of its orthogonal grid memory modeling mechanism, which models long-range spatial dependencies via a state-space model, thereby compensating for the baseline network’s deficiency in global relation reasoning. Similarly, when the HG-HCA Block was integrated alone, the mAP@0.5 reached 0.6996/0.6715. This indicates that its hierarchical macro-micro context aggregation strategy, along with the path weight adjustment method based on the multifractal prior, can effectively align semantic information between high- and low-level features and enhance the model’s discriminative ability in complex backgrounds. The independent introduction of the FGR Block also yielded a stable performance increase to 0.6957/0.6698. Its parallel frequency-domain compensation path successfully mitigated the information decay of small object features during deep network propagation by explicitly recovering high-frequency components.

In the module combination tests, the complementary effects between components became increasingly apparent. The combination of the CGM Block and HG-HCA Block elevated the mAP@0.5 to 0.7134/0.6784, making it the second-best combination after the final model. This demonstrates that the integration of global structural awareness and hierarchical semantic alignment produces a powerful synergy. Ultimately, when all three modules were integrated to form the complete MDF-YOLO architecture, the model’s performance reached its peak, with the mAP@0.5 metric soaring to 0.7158/0.6803. In terms of efficiency, although the introduction of the new modules reduced the FPS from a high of 378.1 to 354.6 and slightly increased the latency to 2.82 ms, the performance remains well within the scope of real-time applications. Overall, MDF-YOLO achieves a significant improvement in accuracy at a minor cost to efficiency, validating the novelty and practicality of our multi-domain fusion strategy.

To intuitively reveal the operational mechanisms of each module at the feature level, we conducted a further visual analysis of the model’s attention distribution via heatmaps before and after module integration. The heatmap comparison in Figure 7 clearly illustrates the role of the CGM Block in enhancing global structural perception. Before its introduction, the baseline model’s response to large-sized objects was relatively scattered, with activation areas appearing fragmented. After its introduction, thanks to the CGM Block’s capture of long-range spatial dependencies via its orthogonal grid memory modeling mechanism, previously discontinuous activation areas on the target were effectively integrated, forming a continuous and uniform high-response region that covers the object’s complete contour. This visually corroborates the ability of CGMB to improve global consistency and localization stability.

Figure 8 showcases the role of the HG-HCA Block in context aggregation and detail preservation. The ablation results indicate that without this module, the model’s attention distribution is relatively diffuse and susceptible to interference from complex background textures, causing the edges of small objects and local details to be obscured. After incorporating the HG-HCA Block, the attention distribution improves markedly. This change is attributable not only to the macro-path’s enhanced modeling of long-range dependencies in smooth regions, guided by the fractal prior, but also to the saliency bias gained by the micro-path in less regular regions, which allows it to more effectively highlight texture and edge features. As seen in the heatmaps, after modulation by the HG-HCA, the model’s attention response on the main body of the target becomes more concentrated, and continuous response bands are formed between different semantic functional areas. This suggests that the local regularity information provided by the fractal prior effectively guides the resource allocation between the macro- and micro-paths, enabling the network to suppress background interference while preserving and enhancing task-relevant fine-grained features. Furthermore, irrelevant high-frequency responses in background areas are weakened, and the attention energy is concentrated within the target and its semantic neighborhood, thereby improving the detection’s discriminability and stability.

Figure 9 intuitively confirms the critical function of the FGR Block in recovering detail information. The baseline model, having lost high-frequency information in the deeper layers of the network, generally exhibited weak responses to small objects and object edges. The core of the FGR Block is its parallel frequency-domain compensation path, which adaptively enhances high-frequency components via a learnable frequency weight matrix. After this module was introduced, the heatmaps show distinct, high-intensity activation bands in detail-dense areas such as window frames and chair legs, as well as along object edges. This phenomenon provides strong evidence that the FGRB successfully recovered the detailed features that had been smoothed out.

Finally, Figure 10 presents a comparison between the baseline model and the complete MDF-YOLO architecture, with heatmaps that are a culmination of all the aforementioned advantages. The response distribution of MDF-YOLO exhibits an ideal state of “stable global structure” and “rich local details”: it forms complete and uniform coverage over large objects, while displaying clear, sharp activations on small objects and complex textures. This eloquently demonstrates that the global consistency from CGMB, the semantic alignment capability of HG-HCA, and the detail recovery ability of FGRB, through their synergistic effects, ultimately achieve a feature representation that most closely approximates the ground-truth annotations, corresponding to the optimal quantitative metrics in Table 3.

Finally, Figure 10 presents a comparison between the baseline model and the complete MDF-YOLO architecture, with heatmaps that are a culmination of all the aforementioned advantages. The response distribution of MDF-YOLO exhibits an ideal state of “stable global structure” and “rich local details”: it forms complete and uniform coverage over large objects, while displaying clear, sharp activations on small objects and complex textures. This eloquently demonstrates that the global consistency from CGMB, the semantic alignment capability of HCA, and the detail recovery ability of FGRB, through their synergistic effects, ultimately achieve a feature representation that most closely approximates the ground-truth annotations, corresponding to the optimal quantitative metrics in Table 3.

5. Discussion

This study focuses on the significantly challenging task of object detection in indoor scenes. In such environments, objects commonly exhibit characteristics like large-scale variations, frequent occlusions, complex textures, and imbalanced layouts, making it difficult for conventional detectors based on convolution or a single paradigm to maintain high detection accuracy while ensuring real-time performance. To address this, this paper proposes a multi-domain fusion detection framework, MDF-YOLO. It introduces the CrossGrid Memory Block (CGMB), the Hölder-Based Regularity Guidance–Hierarchical Context Aggregation (HG-HCA) module, and the Frequency-Guided Residual Block (FGRB) into the network architecture to perform feature enhancement in the state-space domain, spatial domain, and frequency domain, respectively. This establishes a more effective dynamic balance between global structure modeling and the preservation of local discriminability. Through this design, we have not only verified the model’s significant accuracy improvement on multiple public datasets but also maintained a real-time performance level comparable to the original YOLO framework in terms of inference speed and latency.

Experimental results show that MDF-YOLO achieves superior performance compared to existing mainstream detectors on mAP@0.5, mAP@0.75, and the comprehensive metric mAP@0.5:0.95. This performance validates the effectiveness of CGMB in modeling long-range dependencies while preserving 2D topological structure, the success of HG-HCA in adaptively fusing macro and micro contexts under the guidance of Hölder regularity priors, and the capability of FGRB to compensate for fine-grained structural details during high-frequency information recovery. It is noteworthy that the model’s results on FPS and Latency metrics indicate that its overall detection efficiency did not significantly decrease due to structural improvements. This suggests that the proposed method achieves stronger adaptability to complex indoor environments while maintaining a low computational cost.

Nevertheless, this study has certain limitations. First, the HG-HCA module relies on an approximate calculation of the Hölder regularity prior, and its generalization capability across different data distributions requires further validation. Second, although the proposed method maintains a relatively lightweight structure, its measured performance on embedded platforms or low-power devices needs further exploration.

Overall, the significance of this research lies in proposing a low-cost approach to enhance the performance of object detection in indoor scenes. By combining three complementary dimensions—state-space, spatial regularity, and frequency-domain compensation—it implements an effective framework for multi-domain fusion feature modeling. This approach not only provides a new perspective for improving detection accuracy but also offers a potential solution for practical applications under resource-constrained conditions. Future work could be directed towards cross-domain transfer, weakly supervised or self-supervised learning to enhance the model’s applicability and scalability in real-world complex environments.

6. Conclusions

This paper proposes the MDF-YOLO framework to address issues in indoor object detection, such as large-scale variations, frequent occlusions, and detail loss. In its design, the CrossGrid Memory Block is introduced into the backbone network to model long-range dependencies while preserving the 2D topological structure. In the neck stage, the Hölder-Based Regularity Guidance–Hierarchical Context Aggregation module is designed to use a Hölder regularity map as a guide to adaptively balance macro-context and micro-discriminability. In the high-resolution restoration stage, the Frequency-Guided Residual Block is employed to compensate for high-frequency details lost during convolutional up-sampling. The synergy of these three components enables the model to form a complementary mechanism between global structure preservation and local feature enhancement.

Experimental results show that MDF-YOLO surpasses mainstream detectors on multiple accuracy metrics and achieves stronger robustness against small objects and complex occlusions while maintaining near real-time efficiency. This demonstrates that the proposed multi-domain fusion and Hölder regularity guidance mechanisms can effectively improve detection performance at a low cost. Despite these achievements, the study has limitations, such as HG-HCA’s reliance on Hölder regularity approximation and insufficient testing regarding cross-domain transfer and on embedded devices. Future work could further integrate depth information, multi-modal inputs, and lightweight optimization techniques to expand the model’s applicability under resource-constrained conditions.

In summary, the introduction of MDF-YOLO not only demonstrates the potential of combining mathematical priors with deep learning from an academic perspective but also provides a scalable solution for service robots, intelligent security, and indoor perception systems in practical applications.

Author Contributions

Conceptualization, F.L., J.Y. and H.Z.; Methodology, F.L. and J.Y.; Software, F.L. and J.Y.; Validation, F.L., J.Y. and H.Z.; Formal analysis, F.L. and H.Z.; Investigation, F.L. and H.Z.; Resources, J.Y.; Data curation, J.Y. and H.Z.; Writing—original draft, F.L.; Writing—review & editing, F.L., J.Y. and H.Z.; Visualization, J.Y. and H.Z.; Supervision, H.Z.; Project administration, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data derived from public domain resources.

Conflicts of Interest

The authors declare no conflict of interest.

References

Duan, J.; Yu, S.; Tan, H.L.; Zhu, H.; Tan, C. A survey of embodied AI: From simulators to research tasks. IEEE Trans. Emerg. Top. Comput. Intell. 2022, 6, 230–244. [Google Scholar] [CrossRef]
Ye, Y.; Ma, X.; Zhou, X.; Bao, G.; Wan, W.; Cai, S. Dynamic and real-time object detection based on deep learning for home service robots. Sensors 2023, 23, 9482. [Google Scholar] [CrossRef] [PubMed]
Alqobali, R.; Alshmrani, M.; Alnasser, R.; Rashidi, A.; Alhmiedat, T.; Alia, O.M.D. A survey on robot semantic navigation systems for indoor environments. Appl. Sci. 2023, 14, 89. [Google Scholar] [CrossRef]
Chen, W.; Chi, W.; Ji, S.; Ye, H.; Liu, J.; Jia, Y.; Yu, J.; Cheng, J. A survey of autonomous robots and multi-robot navigation: Perception, planning and collaboration. Biomim. Intell. Robot. 2025, 5, 100203. [Google Scholar] [CrossRef]
Raychaudhuri, S.; Chang, A.X. Semantic mapping in indoor embodied AI—A comprehensive survey and future directions. arXiv 2025, arXiv:2501.05750. [Google Scholar]
Sünderhauf, N.; Dayoub, F.; McMahon, S.; Talbot, B.; Schulz, R.; Corke, P.; Wyeth, G.; Upcroft, B.; Milford, M. Place categorization and semantic mapping on a mobile robot. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 5729–5736. [Google Scholar]
Sepulveda, G.; Niebles, J.C.; Soto, A. A deep learning based behavioral approach to indoor autonomous navigation. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 4646–4653. [Google Scholar]
Hoseinnezhad, R. A comprehensive review of deep learning techniques in mobile robot path planning: Categorization and analysis. Appl. Sci. 2025, 15, 2179. [Google Scholar] [CrossRef]
Mirjalili, R.; Krawez, M.; Walter, F.; Burgard, W. VLM-Vac: Enhancing smart vacuums through VLM knowledge distillation and language-guided experience replay. arXiv 2024, arXiv:2409.14096. [Google Scholar]
Huang, B.; Chaki, D.; Bouguettaya, A.; Lam, K.Y. A survey on conflict detection in IoT-based smart homes. ACM Comput. Surv. 2023, 56, 1–40. [Google Scholar] [CrossRef]
Örnek, E.P.; Krishnan, A.K.; Gayaka, S.; Kuo, C.H.; Sen, A.; Navab, N.; Tombari, F. SupeRGB-D: Zero-shot instance segmentation in cluttered indoor environments. IEEE Robot. Autom. Lett. 2023, 8, 3709–3716. [Google Scholar] [CrossRef]
Naseer, M.; Khan, S.; Porikli, F. Indoor scene understanding in 2.5/3D for autonomous agents: A survey. IEEE Access 2018, 7, 1859–1887. [Google Scholar] [CrossRef]
Singh, K.J.; Kapoor, D.S.; Thakur, K.; Sharma, A. Computer-vision based object detection and recognition for service robot in indoor environment. Comput. Mater. Contin. 2022, 72, 197–213. [Google Scholar] [CrossRef]
Wu, X.; Sahoo, D.; Hoi, S.C. Recent advances in deep learning for object detection. Neurocomputing 2020, 396, 39–64. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 779–788. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2980–2988. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 9627–9636. [Google Scholar]
Mittal, P. A comprehensive survey of deep learning-based lightweight object detection models for edge devices. Artif. Intell. Rev. 2024, 57, 242. [Google Scholar] [CrossRef]
Afif, M.; Ayachi, R.; Said, Y.; Atri, M. An evaluation of EfficientDet for object detection used for indoor robots assistance navigation. J. Real-Time Image Process. 2022, 19, 651–661. [Google Scholar] [CrossRef]
Azulay, A.; Weiss, Y. Why do deep convolutional networks generalize so poorly to small image transformations? J. Mach. Learn. Res. 2019, 20, 1–25. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
Rukhovich, D.; Vorontsova, A.; Konushin, A. FCAf3D: Fully convolutional anchor-free 3D object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 477–493. [Google Scholar]
Ren, C.; Wang, A.; Yang, C.; Wu, J.; Wang, M. Frequency domain-based cross-layer feature aggregation network for camouflaged object detection. IEEE Signal Process. Lett. 2025, 32, 2005–2009. [Google Scholar] [CrossRef]
Borenstein, J.; Koren, Y. Obstacle avoidance with ultrasonic sensors. IEEE J. Robot. Autom. 1988, 4, 213–218. [Google Scholar] [CrossRef]
Moravec, H.P.; Elfes, A. High resolution maps from wide angle sonar. In Proceedings of the IEEE International Conference on Robotics and Automation, St. Louis, MO, USA, 25–28 March 1985; IEEE: Piscataway, NJ, USA, 1985; pp. 116–121. [Google Scholar]
Thrun, S.; Burgard, W.; Fox, D. Probabilistic Robotics; MIT Press: Cambridge, MA, USA, 2005. [Google Scholar]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Bay, H.; Tuytelaars, T.; Van Gool, L. SURF: Speeded up robust features. In Proceedings of the European Conference on Computer Vision (ECCV), Graz, Austria, 7–13 May 2006; Springer: Cham, Switzerland, 2006; pp. 404–417. [Google Scholar]
Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Kauai, HI, USA, 8–14 December 2001; IEEE: Piscataway, NJ, USA, 2001; pp. 511–518. [Google Scholar]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA, USA, 20–25 June 2005; IEEE: Piscataway, NJ, USA, 2005; pp. 886–893. [Google Scholar]
Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1627–1645. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B.; Schmid, C. Human detection using oriented histograms of flow and appearance. In Proceedings of the European Conference on Computer Vision (ECCV), Graz, Austria, 7–13 May 2006; Springer: Cham, Switzerland, 2006; pp. 428–441. [Google Scholar]
Johnson, A.E.; Hebert, M. Using spin images for efficient object recognition in cluttered 3D scenes. IEEE Trans. Pattern Anal. Mach. Intell. 1999, 21, 433–449. [Google Scholar] [CrossRef]
Rusu, R.B.; Blodow, N.; Beetz, M. Fast point feature histograms (FPFH) for 3D registration. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Kobe, Japan, 12–17 May 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 3212–3217. [Google Scholar]
Hinterstoisser, S.; Holzer, S.; Cagniart, C.; Ilic, S.; Konolige, K.; Navab, N.; Lepetit, V. Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In Proceedings of the 2011 IEEE International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 858–865. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 580–587. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; pp. 346–361. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NeurIPS); MIT Press: Montreal, QC, Canada, 2015; pp. 91–99. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2961–2969. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 8–16 October 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 10781–10790. [Google Scholar]
Hu, H.; Gu, J.; Zhang, Z.; Dai, J.; Wei, Y. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 3588–3597. [Google Scholar]
Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Occlusion-aware R-CNN: Detecting pedestrians in a crowd. arXiv 2018, arXiv:1807.08407. [Google Scholar]
Wang, A.; Sun, Y.; Kortylewski, A.; Yuilie, A. Robust object detection under occlusion with context-aware CompositionalNets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 12645–12654. [Google Scholar]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS: Improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 5562–5570. [Google Scholar]
Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516. [Google Scholar] [CrossRef]
Ghiasi, G.; Lin, T.Y.; Le, Q.V. NAS-FPN: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 7036–7045. [Google Scholar]
Li, Y.; Chen, Y.; Wang, N.; Zhang, Z. Scale-aware Trident Networks for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 6054–6063. [Google Scholar]
Wang, J.; Chen, K.; Yang, S.; Loy, C.C.; Lin, D. Region proposal by guided anchoring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 2965–2974. [Google Scholar]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. CARAFE: Content-aware reassembly of features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 3007–3016. [Google Scholar]
Qin, Z.; Zhang, P.; Wu, F.; Li, X. FcaNet: Frequency channel attention networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 783–792. [Google Scholar]
Qi, C.R.; Litany, O.; He, K.; Guibas, L.J. Deep Hough voting for 3D object detection in point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 9276–9285. [Google Scholar]
Qi, C.R.; Chen, X.; Litany, O.; Guibas, L.J. ImVoteNet: Boosting 3D object detection in point clouds with image votes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 4404–4413. [Google Scholar]
Zhang, Z.; Sun, B.; Xu, H. H3DNet: 3D object detection using hybrid geometric primitives. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 311–329. [Google Scholar]
Misra, I.; Girdhar, R.; Joulin, A. 3DETR: An end-to-end transformer model for 3D object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 2926–2937. [Google Scholar]
Song, S.; Xiao, J. Deep sliding shapes for amodal 3D object detection in RGB-D images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 808–816. [Google Scholar]
Zhou, M.; Li, B.; Wang, J. Optimization of hyperparameters in object detection models based on fractal loss function. Fractal Fract. 2022, 6, 706. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Advances in neural information processing systems. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Nagy, M. Furniture Detection Dataset[DB/OL]. Roboflow Universe. Roboflow. 2023. Available online: https://universe.roboflow.com/mokhamed-nagy-u69zl/furniture-detection-qiufc (accessed on 20 May 2025).

$Fractalfract 09 00673 g001$

Figure 1. Schematic of the MDF-YOLO network architecture. The overall network consists of three components: Backbone, Neck, and Head. The Backbone stage is embedded with the CrossGrid Memory Block to perform long-range dependency modeling in orthogonal directions and extract local convolutional features. The Neck stage introduces the multifractal prior-based context aggregation module, which achieves global–local feature fusion through macro- and micro-pathways guided by regularity. The up-sampling paths incorporate the Frequency-Guided Residual Block, which utilizes a frequency compensation mechanism to recover high-frequency details and fuses them with the main path in a residual form. Finally, multi-scale detection heads perform classification and bounding box prediction. This design achieves complementary feature synergy across the spatial, state-space, and frequency domains, effectively enhancing detection performance in indoor scenes.

$Fractalfract 09 00673 g001$

$Fractalfract 09 00673 g002$

Figure 2. Operational schematic of the CrossGrid Memory Block. The CrossGrid Memory Block contains two parallel feature processing branches. The left branch is for Orthogonal Grid Memory Modeling, which captures pixel-level long-range dependencies through modeling in orthogonal directions. The right branch is a local feature extraction path, designed to supplement local texture details. The outputs from the two branches are fused and then combined with a residual connection and a convolutional module to produce the final output feature. CGMB achieves complementary feature extraction by combining “long-range dependency modeling along orthogonal directions” with “local convolutional detail supplementation,” and their integration under a residual mechanism ensures effective fusion. This design not only preserves strong spatial awareness and structural consistency but also ensures stable gradient propagation during end-to-end training.

$Fractalfract 09 00673 g002$

$Fractalfract 09 00673 g003$

Figure 3. The overall architecture of the HG-HCA Block. The macro pathway employs orthogonal reordering and convolution to model long-range dependencies, while the micro pathway integrates direction-sensitive convolutions and attention mechanisms to enhance details. Both pathways are dynamically regulated by the Hölder-based regularity prior and are fused within the C2f module, producing feature representations that simultaneously capture both global and local information.

$Fractalfract 09 00673 g003$

$Fractalfract 09 00673 g004$

Figure 4. Schematic of the Frequency-Guided Residual Block. The module comprises two branches: a spatial up-sampling path and a frequency-domain compensation path. The former preserves structural information through interpolation and convolution, while the latter recovers high-frequency details through frequency weight enhancement and transposed convolution. The outputs of the two are fused in the residual domain and then refined by a

1 \times 1

convolution, generating a high-resolution feature map that incorporates both global structure and detailed textures.

Figure 4. Schematic of the Frequency-Guided Residual Block. The module comprises two branches: a spatial up-sampling path and a frequency-domain compensation path. The former preserves structural information through interpolation and convolution, while the latter recovers high-frequency details through frequency weight enhancement and transposed convolution. The outputs of the two are fused in the residual domain and then refined by a

1 \times 1

convolution, generating a high-resolution feature map that incorporates both global structure and detailed textures.

$Fractalfract 09 00673 g004$

$Fractalfract 09 00673 g005$

Figure 5. Sample images from the dataset.

$Fractalfract 09 00673 g005$

$Fractalfract 09 00673 g006$

Figure 6. Visual comparison of detection results from different models in typical indoor scenes.

$Fractalfract 09 00673 g006$

$Fractalfract 09 00673 g007$

Figure 7. Comparison of attention heatmaps before and after the introduction of the CGM Block.

$Fractalfract 09 00673 g007$

$Fractalfract 09 00673 g008$

Figure 8. Comparison of attention heatmaps before and after the introduction of the HCA Block.

$Fractalfract 09 00673 g008$

$Fractalfract 09 00673 g009$

Figure 9. Comparison of attention heatmaps before and after the introduction of the FGR Block.

$Fractalfract 09 00673 g009$

$Fractalfract 09 00673 g010$

Figure 10. Comparison of attention heatmaps between the baseline model and the synergistic effect of all modules.

$Fractalfract 09 00673 g010$

Table 1. Class distribution of the dataset.

Class ID	Class Name	Instances	Proportion (%)	Image Count
0	Bed	1107	13.51	1088
1	Cabinet	916	11.18	826
2	Closet	440	5.37	413
3	Chair	618	7.54	554
4	Lamp	267	3.26	216
5	Nightstand	237	2.89	225
6	Shelf	247	3.01	244
7	Sofa	2210	26.97	2163
8	Table	1134	13.84	1110
9	Wall Panel	391	4.77	364
10	Window	626	7.64	409

Table 2. Performance comparison of different detection models on the dataset.

Model	mAP@50		mAP@50-95		mAP@75		FPS	Latency (ms)
Model	Val	Test	Val	Test	Val	Test	FPS	Latency (ms)
YOLOv8	0.6904	0.6674	0.5641	0.5551	0.6145	0.6072	378.1	2.64
YOLOv9	0.6962	0.6658	0.5684	0.5562	0.6101	0.6114	429.25	2.33
YOLOv10	0.6762	0.6655	0.5624	0.5614	0.6065	0.6122	462.82	2.16
YOLOv11	0.6958	0.6718	0.5758	0.5631	0.6176	0.6025	484.06	2.07
YOLOv12	0.6715	0.6627	0.5535	0.5569	0.6016	0.5952	357.14	2.80
RT-DETR	0.6588	0.6241	0.5264	0.5006	0.5683	0.5369	390.17	2.56
MDF-YOLO	0.7158	0.6803	0.5814	0.5615	0.6117	0.6266	354.6	2.82

Table 3. Impact of different module combinations on model detection performance.

CGM Block	HG-HCA Block	FGR Block	mAP@50		FPS	Latency (ms)
CGM Block	HG-HCA Block	FGR Block	Val	Test	FPS	Latency (ms)
			0.6904	0.6674	378.1	2.64
✓			0.7021 (+0.117)	0.6728 (+0.054)	369.0 (−9.1)	2.71 (+0.07)
	✓		0.6996 (+0.092)	0.6715 (+0.041)	367.6 (−10.5)	2.72 (+0.08)
		✓	0.6957 (+0.053)	0.6698 (+0.024)	374.5 (−3.6)	2.67 (+0.03)
✓	✓		0.7134 (+0.230)	0.6784 (+0.110)	358.4 (−19.7)	2.79 (+0.15)
✓		✓	0.7074 (+0.170)	0.6755 (+0.081)	365.0 (−13.1)	2.74 (+0.10)
	✓	✓	0.7064 (+0.160)	0.6745 (+0.071)	363.6 (−14.5)	2.75 (+0.11)
✓	✓	✓	0.7158 (+0.254)	0.6803 (+0.129)	354.6 (−23.5)	2.82 (+0.18)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Luan, F.; Yang, J.; Zhang, H. MDF-YOLO: A Hölder-Based Regularity-Guided Multi-Domain Fusion Detection Model for Indoor Objects. Fractal Fract. 2025, 9, 673. https://doi.org/10.3390/fractalfract9100673

AMA Style

Luan F, Yang J, Zhang H. MDF-YOLO: A Hölder-Based Regularity-Guided Multi-Domain Fusion Detection Model for Indoor Objects. Fractal and Fractional. 2025; 9(10):673. https://doi.org/10.3390/fractalfract9100673

Chicago/Turabian Style

Luan, Fengkai, Jiaxing Yang, and Hu Zhang. 2025. "MDF-YOLO: A Hölder-Based Regularity-Guided Multi-Domain Fusion Detection Model for Indoor Objects" Fractal and Fractional 9, no. 10: 673. https://doi.org/10.3390/fractalfract9100673

APA Style

Luan, F., Yang, J., & Zhang, H. (2025). MDF-YOLO: A Hölder-Based Regularity-Guided Multi-Domain Fusion Detection Model for Indoor Objects. Fractal and Fractional, 9(10), 673. https://doi.org/10.3390/fractalfract9100673

Article Menu

MDF-YOLO: A Hölder-Based Regularity-Guided Multi-Domain Fusion Detection Model for Indoor Objects

Abstract

1. Introduction

2. Related Work

2.1. Traditional Rule-Based and Handcrafted Feature Methods

2.2. General-Purpose Deep Learning Models

2.3. Explorations Targeting Challenges in Indoor Scenes

3. Methodology

3.1. MDF-YOLO Network Architecture

3.2. CrossGrid Memory Block

3.3. Hölder-Based Regularity Guidance–Hierarchical Context Aggregation Block

3.4. Frequency-Guided Residual Block

4. Experiments

4.1. Experimental Setup

4.2. Experimental Results and Comparison

4.3. Ablation Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI