Context-Aware Dynamic Integration for Scene Recognition

Bae, Chan Ho; Ahn, Sangtae

doi:10.3390/math13193102

Open AccessArticle

Context-Aware Dynamic Integration for Scene Recognition

by

Chan Ho Bae

and

Sangtae Ahn

^*

School of Electornic and Electrical Engineering, Kyungpoook National University, Daegu 41566, Republic of Korea

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(19), 3102; https://doi.org/10.3390/math13193102

Submission received: 26 July 2025 / Revised: 24 September 2025 / Accepted: 25 September 2025 / Published: 27 September 2025

(This article belongs to the Special Issue Advances in Machine Learning and Deep Learning: Innovations and Applications)

Download

Browse Figures

Versions Notes

Abstract

The identification of scenes poses a notable challenge within the realm of image processing. Unlike object recognition, which typically involves relatively consistent forms, scene images exhibit a broader spectrum of variability. This research introduces an approach that combines image and text data to improve scene recognition performance. A model for tagging images is employed to extract textual descriptions of objects within scene images, providing insights into the components present. Subsequently, a pre-trained encoder converts the text into a feature set that complements the visual information derived from the scene images. These features offer a comprehensive understanding of the scene’s content, and a dynamic integration network is designed to manage and prioritize information from both text and image data. The proposed framework can effectively identify crucial elements by adjusting its focus on either text or image features depending on the scene’s context. Consequently, the framework enhances scene recognition accuracy and provides a more holistic understanding of scene composition. By leveraging image tagging, this study improves the image model’s ability to analyze and interpret intricate scene elements. Furthermore, incorporating dynamic integration increases the accuracy and functionality of the scene recognition system.

Keywords:

context-aware; dynamic integration; scene recognition

MSC:

68T45

1. Introduction

Deep learning-based image analysis models have undergone considerable research and development, yet scene image recognition remains challenging [1,2,3,4,5]. This challenge stems from the intrinsic complexity and diversity of scene images, which significantly differ from object images [6]. Subjects such as dogs typically exhibit a relatively uniform appearance in object recognition, allowing for straightforward feature identification and categorization. However, scene images such as living rooms represent a more abstract concept encompassing a wide range of styles, layouts, or components. This diversity within even a single category of scene images poses a unique challenge for image analysis models because they must discern and capture a broader array of features and patterns to identify and classify these scenes accurately. The complexity of scene image recognition is further compounded by the fact that these images often contain multiple objects and elements, contributing to the overall scene context. Unlike object images, which typically focus on a single subject, scene images require understanding the relationships and interactions between image elements. This requirement demands more advanced feature extraction and analysis techniques, extending beyond the conventional methods employed in object recognition. Moreover, the variability of the lighting, perspective, and scale in scene images contributes additional complexity. The same scene can appear markedly different under varying conditions, making consistent and reliable recognition more challenging.

Scene recognition fundamentally differs from object recognition in several key aspects. Object recognition typically focuses on identifying and classifying individual entities with relatively consistent visual characteristics—for instance, recognizing a dog as a specific breed based on distinct morphological features. In contrast, scene recognition aims to understand and classify entire environments or contexts, such as identifying a “living room,” “kitchen,” or “beach.” This task requires comprehending not just individual objects within the scene, but their spatial relationships, contextual arrangements, and the overall semantic meaning they collectively convey. Unlike object recognition, which can often rely on localized discriminative features, scene recognition demands a holistic understanding of the image. A living room, for example, is not defined by the presence of specific objects alone, but by the particular combination, arrangement, and contextual relationships of furniture, decorations, and spatial layout that collectively create the semantic concept of “living room.”

This work focuses exclusively on scene recognition from still images, without incorporating temporal components or video sequences. While video-based scene recognition leverages temporal dynamics and motion patterns to enhance understanding, image-based scene recognition must rely solely on spatial visual information captured in a single frame. This constraint makes the task more challenging, as temporal cues that might disambiguate similar scenes are unavailable, requiring the model to extract comprehensive semantic understanding from static visual content alone.

Figure 1 presents images from the same category, the living room, in the MIT Indoor 67 dataset [7], serving as an example of this diversity. Although they share the same label (living room), the images in Figure 1 display a wide range of styles, layouts, and decor, highlighting the complexity and variability within a single category. In comparison, while varied, images of Shetland sheepdogs in Figure 2 generally present less complexity regarding backgrounds and contexts. The living room images depict higher diversity and intricacy, with each room featuring unique arrangements, furniture, and design elements. This contrast exemplifies the broader range of challenges in categorizing and understanding images with diverse indoor settings compared to more uniform subjects, such as specific dog breeds.

The complexity is further compounded by several factors. (1) Multi-object complexity: Scene images typically contain multiple objects and elements contributing to overall context, requiring understanding of relationships and interactions between components rather than focusing on single subjects. (2) Environmental variability: Changes in lighting conditions, viewing perspectives, and scale can make identical scenes appear markedly different. (3) Semantic abstraction: Scene categories represent high-level semantic concepts that cannot be defined by simple visual templates or object inventories.

Current scene recognition methods face several significant limitations. Traditional approaches that combine object features with scene-level features [9,10,11,12,13] often suffer from restricted object vocabularies. For example, semantic segmentation-based methods like SASceneNet [10] limit recognition to predefined categories (e.g., 152 object types), creating bottlenecks in environments rich with diverse objects, particularly indoor settings where object variety frequently exceeds predetermined taxonomies. Furthermore, existing methods often struggle with the following. (1) Limited semantic understanding: Many approaches focus on low-level visual features without capturing high-level semantic relationships between scene elements. (2) Inadequate feature integration: Conventional fusion strategies fail to effectively combine multi-modal information (visual features, object information, spatial relationships) in semantically meaningful ways. (3) Scale and perspective sensitivity: Most methods lack robustness to variations in viewpoint, scale, and environmental conditions. (4) Context modeling deficiency: Existing approaches inadequately model the complex contextual relationships that define scene semantics.

To address these limitations, this paper proposes a novel approach that leverages the Recognize-Anything Model (RAM) to extract comprehensive object-related textual descriptors, combined with the CLIP framework to capture semantic relationships between textual and visual information. We incorporate a dynamic switch network to efficiently integrate extracted image and text features, thereby enhancing scene recognition performance while overcoming the vocabulary limitations of traditional semantic segmentation approaches. We summarize the highlights of our study as follows.

Highlights of the Proposed Study

Dynamic Multimodal Integration: Introduces a dynamic switch network that adaptively fuses image features with extensive text-based object descriptors from the Recognize-Anything Model (RAM), enabling context-aware weighting for improved scene recognition accuracy.
Enhanced Semantic Understanding: Leverages RAM’s rich and diverse object tagging (over 4500 categories) combined with the CLIP text encoder to capture comprehensive semantic relationships, surpassing traditional approaches limited by smaller category sets.
Versatility and Robust Performance: Demonstrates consistent and superior accuracy across multiple challenging benchmarks (MIT Indoor 67, Places365, SUN397) with both CNN-based and Vision Transformer-based backbones, validating the method’s adaptability and effectiveness.

2. Related Works

2.1. Semantic-Aware Scene Recognition (SASceneNet)

SASceneNet [10] is a model designed for scene recognition, featuring a unique architecture that integrates multiple components to capture comprehensive scene information. The model includes an RGB branch, which extracts features directly from the image, and a semantic branch, which derives features from semantic segmentation provided by a semantic segmentation network. These branches enable the model to understand the visual appearance and contextual elements in a scene. Furthermore, SASceneNet includes an attention module, which is crucial for determining the significance of features extracted by the RGB, referring to the image model, and semantic branches. This module learns to focus on the most relevant information in the input data, enhancing the ability of the model to recognize and interpret complex scenes accurately. By leveraging the attention mechanism, SASceneNet can dynamically prioritize more pertinent features to the scene context, improving recognition performance. Building on semantic-aware approaches, Wang et al. [14] proposed GroupContrast, which introduces semantic-aware self-supervised representation learning that addresses semantic conflicts in 3D scene understanding, demonstrating the broader applicability of semantic-aware approaches beyond 2D scene recognition. Recent work has also explored multimodal scene recognition using semantic segmentation and deep learning integration [15], showing how semantic awareness can be extended to multi-sensor scenarios. Additionally, researchers have developed multiscale feature fusion models for indoor scene recognition [16], incorporating semantic-aware strategies to improve performance in constrained environments.

2.2. Fusion of Object and Scene Network (FOSNet)

FOSNet [11] is a scene recognition framework designed to enhance scene recognition performance by leveraging unique scene image characteristics, introducing the scene coherence loss, which is based on the trait that the class of a scene image is consistent throughout the entire image, guiding the learning process to maintain class consistency across the entire image, and incorporating correlative context gating that efficiently combines object and scene features by selecting important features and effectively fusing them, thereby reducing feature redundancy and improving recognition performance, unlike previous methods that simply concatenated features. The fusion-based approach pioneered by FOSNet has inspired subsequent developments in multi-modal object–scene integration. Deevi et al. [17] presented RGB-X object detection via scene-specific fusion modules, which leverages scene-specific attention mechanisms similar to FOSNet’s correlative context gating but extends to multi-modal sensor fusion. Furthermore, Alazeb et al. [18] developed a comprehensive multi-object detection and scene recognition system that incorporates object–scene relationships, demonstrating the continued relevance of fusion-based approaches in modern applications. Recent advances have also shown the effectiveness of CNN-based multi-object segmentation and feature fusion strategies [19] in maintaining the core principles established by FOSNet while improving computational efficiency.

2.3. Multiple Representation Network (MRNet)

MRNet [4] utilizes two pretrained CNNs to extract original feature maps, each contributing distinct visual information, thereby generating more diverse scene representations. The enhanced global scene representation is obtained using a pretrained CNN, capturing the overall layout and structure of the scene, which provides a broad context essential for distinguishing between different scene categories. Local salient scene representation is derived using Class Activation Mapping (CAM) applied to the feature maps generated by another pretrained CNN. CAM helps identify and focus on the most important regions within the scene, allowing the network to extract detailed local features that are crucial for recognizing specific scene elements. Local contextual object representation is achieved through a bidirectional Long Short-Term Memory (LSTM) module. This component encodes the contextual information of objects present in the scene, capturing the relationships and interactions between different objects. This contextual encoding enhances the network’s ability to understand complex scenes where the arrangement and interaction of objects play a significant role in scene recognition.

The multiple representation paradigm has been further developed in various domains. Du et al. [20] proposed Translate-to-Recognize Networks (TRecgNet) for RGB-D scene recognition, which utilizes multiple representation learning through translation between modalities, showing how multi-view approaches can be extended to depth information. Additionally, recent medical imaging applications have demonstrated the versatility of MRNet architectures, with multifaceted resilient networks being adapted for medical image analysis [21], showcasing the broader applicability of multiple representation learning beyond traditional scene recognition tasks.

2.4. Recognize-Anything Model (RAM)

RAM [22] is an image-tagging model that handles various objects within images. This model can extract text related to over 6400 objects and effectively generate more than 4500 unique semantic tags, especially when synonyms are not considered. Integrating and processing data from various academic datasets for classification, detection, and segmentation using commercial tagging tools leads to a rich and diverse set of object tags. The architecture of RAM consists of three main components. The first is an image encoder for extracting features from the image. This module is crucial for identifying distinct visual elements that can be tagged. The second component is an image-tag recognition decoder, focusing on the tagging process using features extracted by the encoder to identify and label image objects. The third component is a text generation encoder–decoder for creating captions. This module can contextualize the identified objects and their interactions in the image, providing a descriptive text output. One of the notable attributes of RAM is its zero-shot recognition capability, indicating that the model can accurately recognize and tag categories of objects on which it has not explicitly been trained. This feature is beneficial in dealing with various objects and scenarios, making RAM versatile and practical across image contexts. The evolution of RAM has continued with extended versions [22] demonstrating improved open-set recognition capabilities and integration into various downstream applications. Recent implementations have shown how RAM can be combined with text generation models like Tag2Text for enhanced scene understanding and captioning tasks [23], establishing RAM as a foundational model for large-scale image understanding.

2.5. Contrastive Language-Image Pretraining (CLIP)

CLIP [24] is a multimodal model that learns the associations between images and text. The primary concept of CLIP is to maximize the semantic similarity between modalities using text descriptions (captions) associated with the images. To achieve this, CLIP employs the contrastive loss function, minimizing the similarity between image–text pairs and negative samples. Pretrained on approximately 400 million image–text pairs, with extensive training on massive datasets, CLIP demonstrates impressive transfer learning capabilities across downstream tasks, exhibiting exceptional zero- and few-shot performance. This method has demonstrated remarkable results in numerous vision benchmarks, including image classification, object detection, and visual question answering. A transformer-based framework comprising vision and text encoders is central to the CLIP architecture. The vision transformer (ViT) encodes images into visual features, whereas the natural language processing transformer encodes the text into linguistic features. The learning process progresses by measuring the similarity of these two feature embeddings. CLIP’s versatility has been demonstrated across various specialized applications in scene recognition. Bose et al. [25] developed MovieCLIP for visual scene recognition in movies, creating a movie-centric taxonomy and demonstrating CLIP’s effectiveness in domain-specific scene recognition tasks. In autonomous driving applications, Elhenawy et al. [26] showed CLIP’s potential in dynamic scene understanding, achieving 91.1% F1 score in scene classification and demonstrating real-time deployment capabilities. Furthermore, Zhao et al. [27] adapted CLIP for scene text recognition through CLIP4STR, showing how vision-language models can be effectively transferred to text-centric scene understanding tasks. Recent work has also explored CLIP’s application in fine-grained domain identification [28], further expanding its utility in specialized recognition scenarios.

2.6. Dynamic Switch Networks

The dynamic switch network [29] was inspired by the gated recurrent units in RNNs. The dynamic switch network was designed to control the amount of information flowing dynamically between large pre-trained language models, such as BERT and neural machine translation (NMT). The dynamic switch network assesses the input signals emitted from both models, outputting a number between 0 and 1 for each element. A value of 1 signifies the complete transmission of that element, whereas 0 indicates its total exclusion:

g = σ (W h^{l m} + U h^{n m t} + b),

(1)

where

σ (\cdot)

represents the logistic sigmoid function,

h^{l m}

denotes the hidden state of the pretrained language model, and

h^{n m t}

denotes the hidden state of the original NMT model. The gating function g serves as a learnable attention mechanism that determines the relative importance of information from both models at each time step. The linear transformation

W h^{l m} + U h^{n m t} + b

combines the hidden representations from both models through learned weight matrices

W \in R^{d \times d_{l m}}

and

U \in R^{d \times d_{n m t}}

, where d is the output dimension,

d_{l m}

is the dimension of the language model hidden state, and

d_{n m t}

is the dimension of the NMT hidden state. The bias term

b \in R^{d}

provides additional flexibility in the gating decision. The sigmoid activation function

σ (x) = \frac{1}{1 + e^{- x}}

ensures that each element of g lies within the range, enabling it to function as a continuous gate that can smoothly interpolate between the two input sources.

The hidden states of the NMT model and pretrained language model are integrated as follows:

h = g ⊙ h^{l m} + (1 - g) ⊙ h^{n m t},

(2)

where ⊙ represents elementwise multiplication. With g set to 0, the network outputs only from

h^{n m t}

, and when set to 1, it outputs only from

h^{l m}

. This equation implements a weighted combination of the two hidden representations using the computed gate values. The operation

g ⊙ h^{l m}

performs elementwise multiplication between the gate vector and the language model hidden state, effectively scaling each dimension of

h^{l m}

by its corresponding gate value. Similarly,

(1 - g) ⊙ h^{n m t}

scales the NMT hidden state by the complement of the gate values, ensuring that the total contribution from both sources sums to the original magnitude. The final hidden state h represents a dynamic fusion where each dimension can independently choose how much information to retain from each source model. When

g_{i} = 1

for dimension i, the output

h_{i}

equals

h_{i}^{l m}

, completely relying on the language model. Conversely, when

g_{i} = 0

, the output

h_{i}

equals

h_{i}^{n m t}

, fully utilizing the NMT model. For intermediate values

0 < g_{i} < 1

, the output represents a weighted interpolation, allowing the network to leverage complementary strengths from both models in a fine-grained, dimension-specific manner.

The dynamic switching paradigm has been extended to various domains beyond natural language processing. Eser et al. [30] explored dynamic link switching in network synchronization, providing theoretical foundations for understanding when and how dynamic switching improves system performance. In distributed systems, Zafar et al. [31] proposed DSMLB (Dynamic Switch Migration-based Load Balancing), demonstrating practical applications of dynamic switching mechanisms in IoT and multi-controller software-defined networks. Recent neuroscience research has also shown that dynamic switching between brain networks predicts cognitive performance [32], suggesting that the dynamic switch concept extends to biological information processing systems.

2.7. Recent Surveys and Comprehensive Approaches

Recent comprehensive surveys [33,34] have identified scene recognition as a critical component of modern intelligent systems. These works emphasize the importance of multi-scale feature fusion, discriminative region detection, and object correlation analysis in contemporary scene recognition frameworks. The surveys highlight six main categories of approaches: spatial layout pattern learning, discriminative region detection, object correlation analysis, hybrid deep models, attention mechanisms, and multi-modal fusion strategies.

Contemporary research has focused on optimizing multimodal scene recognition through relevant feature fusion and transfer learning [35], demonstrating how traditional approaches can be enhanced through modern deep learning techniques. Additionally, advances in scene text detection and recognition [34] have shown the interconnected nature of scene understanding tasks, where text-based semantic information complements visual scene analysis. These comprehensive approaches highlight the evolution from single-modality methods to integrated, multi-faceted systems that leverage various types of visual and semantic information for robust scene recognition.

3. Methods

3.1. Overall Architecture

In this paper, the Image Model is used to extract fundamental features from a scene image. Rather than being fixed to a single architecture, the Image Model can incorporate both CNN-based image models and Vision Transformer models. In our experiments, we employed ResNet-50 and Swin-Tiny to validate and compare performance. RAM is designed to extract information about various objects from a given scene image in textual form. By identifying multiple objects within the image, RAM provides text-based descriptors that can be further processed by subsequent modules. The CLIP Text Encoder receives the textual descriptions generated by RAM and converts them into embedding vectors. This encoding process captures semantic relationships between words, enabling more effective integration with image-based features in later stages. As illustrated in Figure 3, the Dynamic Switch Network takes both the scene image embeddings (from the Image Model) and the textual embeddings (from the CLIP Text Encoder) as inputs. It is trained to focus on the most relevant information from each source, dynamically adjusting its attention and weighting to enhance the overall performance for downstream tasks.

The proposed approach enhances scene recognition by fusing image and text features. Initially, an image model extracts features from scene images, capturing essential visual details. The RAM [22], an image-tagging model, identifies and extracts text descriptions for scene objects, covering a range of 4585 object categories. Next, the extracted text descriptions are input into the CLIP [24] text encoder designed to comprehend the relationship between text and images, transforming these descriptions into one-dimensional text feature tensors suitable for integrating image features. The central aspect of the proposed method involves combining these two types of features: the image features from the scene and text features related to objects within it. After that, the dynamic switch performs the integration, merging image and text features into a single embedding, which encompasses visual and textual elements of the scene:

\begin{matrix} g & = σ (W i + U t + b) \\ h & = g ⊙ i + (1 - g) ⊙ t \end{matrix}

(3)

where i represents the image embedding output by the image model, whereas t denotes the text embedding output by the text model. As the dynamic switch network is trained, it becomes possible to adjust the amount of information flow dynamically input from the two models, depending on the received embeddings. Finally, this combined embedding is processed in the classification layer, categorizing the scene based on the fused features. This dual-feature approach provides a more detailed understanding of the scene, contributing to more accurate and efficient scene recognition. The proposed method effectively employs the strengths of visual and textual data, offering a refined technique in scene recognition. The two-stage training approach initially trains the image model. Then, the image model, RAM, and the CLIP text encoder are fixed, allowing only the dynamic switch network and classification layers to be trained. The two-stage learning procedure ensures that the image model is trained well through various augmentation techniques and prevents the dynamic switch network from being dominated by one branch, enabling balanced learning.

3.2. Dataset

We used the MIT Indoor 67 [7], Places365 [36], and SUN397 [37] datasets for a comprehensive evaluation. These datasets are diverse and challenging, providing a platform for testing the effectiveness of the approach.

3.2.1. MIT Indoor 67

The MIT Indoor 67 dataset [7] is a widely used benchmark in scene recognition research, comprising 15,620 images representing 67 distinct indoor categories, from commonplace environments (e.g., living rooms and kitchens) to more specialized settings (e.g., libraries and stores). Each category in the dataset is represented by approximately 100 images, providing a diverse set of visual data encapsulating the variety and complexity typical of indoor scenes. The significance of the MIT Indoor 67 dataset lies in its challenge to computer vision models because indoor scenes often contain a high degree of variability of objects, layout, and lighting conditions. This variability makes accurate scene classification challenging and computationally demanding, serving as a testbed for evaluating the robustness and accuracy of various machine learning models.

3.2.2. Places365

The Places365 dataset [36] is an extensive collection of images designed to facilitate developing and evaluating machine learning models in scene recognition. The dataset is composed of over 1.8 million images spanning 365 scene categories. These categories encompass diverse environments, from natural landscapes (e.g., beaches and mountains) to human-made settings (e.g., restaurants and subway stations). Each category in the Places365 dataset contains approximately 5000 images, curated to provide a balanced representation of scene types. This extensive collection is derived from a larger corpus, the Places dataset, which originally included more than 10 million images. The dataset is split into training, validation, and testing sets, with 1.6 million images designated for training and the remainder divided equally between validation and testing sets.

3.2.3. SUN397

The SUN397 dataset [37] is an expansive and widely recognized benchmark in scene recognition and is part of the SUN database, designed to facilitate comprehensive studies in scene understanding and classification. The SUN397 dataset specifically includes 108,754 images that span 397 unique scene categories, providing a broad spectrum of environments that range from natural landscapes, such as forests and oceans, to human-made structures, such as churches and bedrooms. Each category in the dataset includes at least 100 images, ensuring a balanced representation that aids in mitigating bias that may arise from uneven data distribution. This feature is critical for training robust machine learning models that must perform well across diverse visual contexts.

4. Results

4.1. Performance Evaluation

To evaluate the effectiveness of our proposed multimodal scene recognition approach, we conducted comprehensive experiments across multiple benchmark datasets using different backbone architectures. Our method integrates visual features from CNN/ViT models with textual descriptions from RAM through a dynamic switch network, enabling more comprehensive scene understanding. During training, we fixed the parameters of the RGB backbone, RAM, and CLIP text encoder to maintain their pretrained states, allowing focused optimization of the dynamic switch network and classification layer. This approach leverages robust pre-learned feature representations while training only the components specific to our multimodal fusion strategy.

We employed a series of data augmentation techniques to enhance the robustness of the model by introducing variability in the training dataset. Initially, Gaussian blur was applied randomly to each image to simulate slight focus variations often encountered in real-world scenarios. Subsequently, contrast normalization was employed to adjust the contrast levels, ensuring model sensitivity across various lighting conditions. Further, additive Gaussian noise was added to the images to mimic the effect of sensor noise, commonly present in digital photography. This addition was followed by employing the multiply method to alter the brightness randomly, accommodating the model to diverse exposure levels. After these augmentations, images were randomly cropped to 224 × 224 to focus on scene sections, training the model to recognize scenes from partial views [38]. Additionally, we applied a random horizontal flip to each image with a probability of 50% to account for the orientation variability in natural scenes. Finally, a normalization transformation was performed to standardize the pixel values across the dataset, facilitating stabler and faster convergence during the neural network training. These comprehensive data augmentation strategies are instrumental in enhancing the generalizability of the model to classify diverse and challenging scene conditions accurately in real-world applications.

We used a learning rate of 0.00005 with the AdamW optimizer and cross-entropy loss for training. This choice is standard for classification tasks because it measures the discrepancy between the predicted class probabilities and actual distribution, providing a robust metric for optimizing the model. By combining these advanced optimization techniques and a suitable loss function, this setup refines the learning process, improving the ability of the model to classify complex scenes under diverse conditions accurately. During evaluation, we employed a ten crop evaluation method to further enhance the model’s robustness and performance. This technique involves using ten different crops from each image, ensuring a more comprehensive assessment and better generalization of the model’s capabilities.

Our proposed method leverages RAM’s capability to recognize up to 4585 object categories, providing detailed textual descriptions that complement visual features. This is particularly beneficial for indoor scene recognition, where complex object arrangements and categorical similarities require comprehensive understanding beyond traditional visual features alone.

4.2. Comparison with State-of-the-Art Methods

We systematically compared our approach against existing state-of-the-art scene recognition methods across multiple benchmark datasets. Table 1 presents comprehensive comparisons on MIT Indoor 67 and SUN397 datasets using ResNet-50 as the backbone architecture. Our method achieved superior performance with 89.28% accuracy on MIT Indoor 67 and 75.18% on SUN397, outperforming previous state-of-the-art approaches. Notably, we surpassed MRNet (88.08% and 73.98%), SASceneNet (87.10% and 74.04%), and MFAFSNet (88.06% and 73.35%) on both datasets. The improvement is particularly significant on SUN397, where our method achieved a 1.14% improvement over the previous best result, demonstrating the effectiveness of incorporating detailed textual object information for scene understanding.

To verify the generalizability of our approach across different architectural paradigms, we evaluated the method using Swin Transformer tiny as the backbone. Table 2 demonstrates consistent improvements across three benchmark datasets: MIT Indoor 67 (80.66% → 85.98%), Places365 (54.81% → 57.39%), and SUN397 (63.14% → 66.15%). These results confirm that our multimodal fusion strategy is architecture-agnostic and consistently enhances performance regardless of the underlying image model type.

The consistent improvements across different backbone architectures and datasets validate that our approach successfully combines complementary information from visual and textual modalities. The method demonstrates particular strength in complex indoor environments where detailed object recognition significantly contributes to accurate scene classification.

4.3. Ablation Study

We conducted systematic ablation studies to validate the contribution of each component in our proposed architecture. Table 3 presents results comparing individual modules against the complete integrated system on MIT Indoor 67 and SUN397 datasets.

The results clearly demonstrate the complementary nature of visual and textual modalities. On MIT Indoor 67, the Image Model alone achieved 84.40% accuracy while the Text Model reached 79.54%. However, our integrated approach significantly outperformed both individual components, achieving 89.28% accuracy. Similar trends were observed on SUN397, where individual modules achieved 70.87.

These findings confirm that visual and textual information provides complementary scene understanding capabilities. Visual features capture spatial relationships and visual patterns, while textual descriptions from RAM provide detailed object-level semantic information. The dynamic switch network effectively fuses these modalities, resulting in superior scene recognition performance that exceeds the capabilities of either modality alone.

5. Conclusions

This paper proposes a novel multimodal scene recognition method that significantly advances the state of the art by integrating visual features with detailed textual object descriptions through a dynamic switch network. Our key findings demonstrate that this approach achieves substantial performance improvements: 89.28% accuracy on MIT Indoor 67 (surpassing previous best by 1.20%) and 75.18% on SUN397 (exceeding previous best by 1.14%). These improvements represent meaningful advances in a mature field where incremental gains are increasingly difficult to achieve.

The core finding of our research is that dynamically balancing visual and textual modalities through learned attention mechanisms enables more comprehensive scene understanding than either modality alone. Our ablation studies reveal that the integrated approach outperforms individual visual (84.40%) and textual (79.54%) components by significant margins, confirming the complementary nature of these information sources. This finding challenges the traditional reliance on purely visual approaches and establishes multimodal fusion as a promising direction for scene recognition.

Importantly, our findings demonstrate architecture-agnostic effectiveness, with consistent improvements observed across both CNN-based (ResNet-50) and Transformer-based (Swin-Transformer) backbones. The method shows particular strength in complex indoor environments, where detailed object-level understanding proves crucial for accurate scene classification. These findings suggest that incorporating rich semantic object information addresses fundamental limitations in current scene recognition approaches, opening new avenues for leveraging multimodal pretraining in computer vision tasks.

The robustness of our findings across multiple benchmark datasets (MIT Indoor 67, SUN397, Places365) and evaluation protocols validates the practical applicability of this approach, positioning it as a reliable enhancement for diverse scene recognition applications.

Author Contributions

Conceptualization, C.H.B. and S.A.; methodology, C.H.B. and S.A.; validation, C.H.B. and S.A.; formal analysis, C.H.B.; investigation, C.H.B. and S.A.; writing—original draft preparation, C.H.B.; writing—review and editing, S.A.; visualization, C.H.B.; supervision, S.A.; project administration, S.A.; funding acquisition, S.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. RS-2025-02214941) and by the Institute of Information and Communications Technology Planning and Evaluation (IITP) grant funded by the Korean Government (MSIT) (No. RS-2025-02218631).

Data Availability Statement

The datasets used for this study are available in the public domain (MIT Indoor 67: https://web.mit.edu/torralba/www/indoor.html accessed on 1 March 2025) and are downloaded from the TensorFlow Datasets repository (Places365 and SUN397).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Thapa, A.; Horanont, T.; Neupane, B.; Aryal, J. Deep learning for remote sensing image scene classification: A review and meta-analysis. Remote Sens. 2023, 15, 4804. [Google Scholar] [CrossRef]
Zeng, D.; Liao, M.; Tavakolian, M.; Guo, Y.; Zhou, B.; Hu, D.; Pietikäinen, M.; Liu, L. Deep learning for scene classification: A survey. arXiv 2021, arXiv:2101.10531. [Google Scholar] [CrossRef]
Matei, A.; Glavan, A.; Talavera, E. Deep learning for scene recognition from visual data: A survey. In Proceedings of the International Conference on Hybrid Artificial Intelligence Systems, Gijón, Spain, 11–13 November 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 763–773. [Google Scholar]
Lin, C.; Lee, F.; Xie, L.; Cai, J.; Chen, H.; Liu, L.; Chen, Q. Scene recognition using multiple representation network. Appl. Soft Comput. 2022, 118, 108530. [Google Scholar] [CrossRef]
Kwon, W.; Baek, S.; Baek, J.; Shin, W.; Gwak, M.; Park, P.; Lee, S. Reinforced Intelligence Through Active Interaction in Real World: A Survey on Embodied AI. Int. J. Control. Autom. Syst. 2025, 23, 1597–1612. [Google Scholar] [CrossRef]
Peng, J.; Mei, X.; Li, W.; Hong, L.; Sun, B.; Li, H. Scene complexity: A new perspective on understanding the scene semantics of remote sensing and designing image-adaptive convolutional neural networks. Remote Sens. 2021, 13, 742. [Google Scholar] [CrossRef]
Quattoni, A.; Torralba, A. Recognizing indoor scenes. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2019; pp. 413–420. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
Bau, D.; Zhou, B.; Khosla, A.; Oliva, A.; Torralba, A. Network dissection: Quantifying interpretability of deep visual representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6541–6549. [Google Scholar]
López-Cifuentes, A.; Escudero-Viñolo, M.; Bescós, J.; García-Martín, A. Semantic-aware scene recognition. Pattern Recognit. 2020, 102, 107256. [Google Scholar] [CrossRef]
Seong, H.; Hyun, J.; Kim, E. FOSNet: An end-to-end trainable deep neural network for scene recognition. IEEE Access 2020, 8, 82066–82077. [Google Scholar] [CrossRef]
Herranz, L.; Jiang, S.; Li, X. Scene recognition with cnns: Objects, scales and dataset bias. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las VEgas, NV, USA, 27–30 June 2016; pp. 571–579. [Google Scholar]
Cheng, X.; Lu, J.; Feng, J.; Yuan, B.; Zhou, J. Scene recognition with objectness. Pattern Recognit. 2018, 74, 474–487. [Google Scholar] [CrossRef]
Wang, C.; Jiang, L.; Wu, X.; Tian, Z.; Peng, B.; Zhao, H.; Jia, J. GroupContrast: Semantic-aware Self-supervised Representation Learning for 3D Understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
Naseer, A.; Alnusayri, M.; Alhasson, H.F.; Alatiyyah, M.; AlHammadi, D.A.; Jalal, A.; Park, J. Multimodal scene recognition using semantic segmentation and deep learning integration. PeerJ Comput. Sci. 2025, 11, e2858. [Google Scholar] [CrossRef]
Li, J.W.; Yan, G.W.; Jiang, J.W.; Cao, Z.; Zhang, X.; Song, B. Construction of a multiscale feature fusion model for indoor scene recognition. Sci. Rep. 2025, 15, 14701. [Google Scholar] [CrossRef]
Deevi, P.K.; Lee, C.; Gan, L.; Nagesh, S.; Pandey, G.; Chung, S.J. RGB-X Object Detection via Scene-Specific Fusion Modules. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024. [Google Scholar]
Alazeb, A.; Chughtai, B.R.; Al Mudawi, N.; AlQahtani, Y.; Alonazi, M.; Aljuaid, H.; Jalal, A.; Liu, H. Remote intelligent perception system for multi-object detection and scene recognition. Sci. Rep. 2024, 14, 1398703. [Google Scholar]
Rafique, A.A.; Ghadi, Y.Y.; Alsuhibany, S.A.; Chelloug, S.A.; Jalal, A.; Park, J. CNN Based Multi-Object Segmentation and Feature Fusion for Scene Classification. J. Med. Imaging Health Inform. 2022, 73, 4657–4675. [Google Scholar]
Du, D.; Wang, L.; Wang, H.; Zhao, K.; Wu, G. Translate-to-Recognize Networks for RGB-D Scene Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Lee, H.; Jo, Y.; Hong, I.; Park, S. MRNet: Multifaceted Resilient Networks for Medical Image-Text Retrieval. arXiv 2024, arXiv:2412.03039. [Google Scholar]
Zhang, Y.; Huang, X.; Ma, J.; Li, Z.; Luo, Z.; Xie, Y.; Qin, Y.; Luo, T.; Li, Y.; Liu, S.; et al. In Proceedings of the CVPR 2024 Workshop, Seattle, WA, USA, 17–21 June 2024.
Zhang, Y. Recognize-Anything: Open-Source and Strong Foundation Model for Image Tagging. 2023. Available online: https://github.com/xinyu1205/recognize-anything (accessed on 1 March 2025).
Pan, X.; Ye, T.; Han, D.; Song, S.; Huang, G. Contrastive Language-Image Pre-Training with Knowledge Graphs. arXiv 2022, arXiv:2210.08901. [Google Scholar]
Bose, S.; Hebbar, R.; Somandepalli, K.; Zhang, H.; Cui, Y.; Cole-McLaughlin, K.; Wang, H.; Narayanan, S. MovieCLIP: Visual Scene Recognition in Movies. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023. [Google Scholar]
Elhenawy, M.; Ashqar, H.I.; Rakotonirainy, A.; Alhadidi, T.I.; Jaber, A.; Tami, M.A. Vision-Language Models for Autonomous Driving: CLIP-Based Dynamic Scene Understanding. Electronics 2025, 14, 1282. [Google Scholar] [CrossRef]
Zhao, S.; Wang, X.; Zhu, L.; Yang, Y. CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model. arXiv 2023, arXiv:2305.14014. [Google Scholar] [CrossRef] [PubMed]
Zhao, H.; Qi, L.; Geng, X. CILP-FGDI: Exploiting Vision-Language Model for Fine-Grained Domain Identification. arXiv 2025, arXiv:2501.16065. [Google Scholar]
Yang, J.; Wang, M.; Zhou, H.; Zhao, C.; Zhang, W.; Yu, Y.; Li, L. Towards making the most of bert in neural machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 9378–9385. [Google Scholar]
Cenk Eser, M.; Medeiros, E.S.; Riza, M.; Engel, M. Dynamic link switching induces stable synchronized states in sparse networks. arXiv 2025, arXiv:2507.08007. [Google Scholar]
Zafar, S.; Lv, Z.; Zaydi, N.H.; Ibrar, M.; Hu, X. DSMLB: Dynamic switch-migration based load balancing for multicontroller SDN in IoT. Comput. Netw. 2022, 214, 109145. [Google Scholar] [CrossRef]
Chen, Q.; Kenett, Y.N.; Cui, Z.; Takeuchi, H.; Fink, A.; Benedek, M.; Zeitlen, D.C.; Zhuang, K.; Lloyd-Cox, J.; Kawashima, R.; et al. Dynamic switching between brain networks predicts cognitive performance. Commun. Biol. 2025, 8, 54. [Google Scholar]
Xie, L.; Lee, F.; Liu, L.; Kotani, K.; Chen, Q. Scene recognition: A comprehensive survey. Pattern Recognit. 2020, 102, 107205. [Google Scholar] [CrossRef]
Liu, X.; Meng, G.; Pan, C. Scene text detection and recognition with advances in deep learning: A survey. Int. J. Doc. Anal. Recognit. 2019, 22, 143–162. [Google Scholar] [CrossRef]
Sumathi, K.; Kumar, P.; Mahadevaswamy, H.R.; Ujwala, B.S. Optimizing multimodal scene recognition through relevant feature fusion and transfer learning. MethodsX 2025, 14, 103226. [Google Scholar]
Zhou, B.; Lapedriza, A.; Khosla, A.; Oliva, A.; Torralba, A. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1452–1464. [Google Scholar] [CrossRef] [PubMed]
Xiao, J.; Hays, J.; Ehinger, K.A.; Oliva, A.; Torralba, A. Sun database: Large-scale scene recognition from abbey to zoo. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 3485–3492. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Yang, S.; Ramanan, D. Multi-scale recognition with DAG-CNNs. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1215–1223. [Google Scholar]
Jiang, S.; Chen, G.; Song, X.; Liu, L. Deep patch representations with shared codebook for scene classification. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2019, 15, 1–17. [Google Scholar] [CrossRef]
Xie, G.S.; Zhang, X.Y.; Yan, S.; Liu, C.L. Hybrid CNN and dictionary-based models for scene recognition and domain adaptation. IEEE Trans. Circuits Syst. Video Technol. 2015, 27, 1263–1274. [Google Scholar] [CrossRef]
Guo, S.; Huang, W.; Wang, L.; Qiao, Y. Locally supervised deep hybrid model for scene recognition. IEEE Trans. Image Process. 2016, 26, 808–820. [Google Scholar] [CrossRef]
Liu, Y.; Chen, Q.; Chen, W.; Wassell, I. Dictionary learning inspired deep network for scene recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Wang, Z.; Wang, L.; Wang, Y.; Zhang, B.; Qiao, Y. Weakly supervised patchnets: Describing and aggregating local patches for scene recognition. IEEE Trans. Image Process. 2017, 26, 2028–2041. [Google Scholar] [CrossRef]
Bai, S.; Tang, H.; An, S. Coordinate CNNs and LSTMs to categorize scene images with multi-views and multi-levels of abstraction. Expert Syst. Appl. 2019, 120, 298–309. [Google Scholar] [CrossRef]
Zhao, Z.; Larson, M. From volcano to toyshop: Adaptive discriminative region discovery for scene recognition. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 1760–1768. [Google Scholar]
Pan, Y.; Xia, Y.; Shen, D. Foreground fisher vector: Encoding class-relevant foreground to improve image classification. IEEE Trans. Image Process. 2019, 28, 4716–4729. [Google Scholar] [CrossRef]
Dixit, M.; Li, Y.; Vasconcelos, N. Semantic fisher scores for task transfer: Using objects to classify scenes. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 3102–3118. [Google Scholar] [CrossRef]

Figure 1. Living rooms in MIT Indoor 67 dataset [7].

Figure 2. Shetland sheepdogs in ImageNet-1k dataset [8].

Figure 3. Proposed architecture for scene recognition. This design comprises two main branches; one branch employs the Image Model to extract overall scene features, while the other leverages the Recognize-Anything (RAM) model and the CLIP Text Encoder to focus on object-level information within the scene. Dynamic Switch Network: The fully connected layer ensures that the gate vector g has the same dimensionality as both the text and image embeddings, thus enabling the element-wise (⊙) operations between g and each embedding.

Table 1. Classification accuracy with existing models on MIT Indoor 67 and SUN397 datasets. Note that the methods without reference have been extracted from previous work (Lin et al.) [4].

Method	Backbone	MIT Indoor 67 Top 1 Accuracy (%)	SUN397 Top 1 Accuracy (%)
DAG-CNN [39]	VGG	77.50	56.20
Mix-CNN [40]	VGG	79.63	57.47
Hybrid CNNs [41]	VGG	82.24	64.53
LS-DHM [42]	VGG	83.75	67.56
Multiscale CNNs [12]	VGG	86.04	70.17
Dual CNN-DL [43]	VGG	86.43	70.13
VSAD [44]	VGG	86.20	73.00
SDO [13]	VGG	86.76	73.41
MVML-LSTM [45]	VGG	80.52	63.02
Adi-Red [46]	ResNet 50	-	73.59
fgFV [47]	ResNet 50	85.35	-
MFAFSNet [48]	ResNet 50	88.06	73.35
SASceneNet [10]	ResNet 50	87.10	74.04
MRNet [4]	ResNet 50	88.08	73.98
Ours	ResNet 50	89.28	75.18

Table 2. Classification accuracy on Swin-tiny model on MIT Indoor 67, Places365, and SUN397 datasets.

	Swin-Tiny Accuracy (%)	Swin-Tiny + Ours Accuracy (%)
MIT Indoor 67	80.66	85.98
Places365	54.81	57.39
SUN397	63.14	66.15

Table 3. Performance comparison between the proposed method (Image+Text) and the individual modules.

Module	MIT Indoor 67 Top 1 Accuracy (%)	SUN397 Top 1 Accuracy (%)
Image Model	84.40	70.87
Text Model	79.54	61.26
Proposed Method (Image + Text)	89.28	75.18

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bae, C.H.; Ahn, S. Context-Aware Dynamic Integration for Scene Recognition. Mathematics 2025, 13, 3102. https://doi.org/10.3390/math13193102

AMA Style

Bae CH, Ahn S. Context-Aware Dynamic Integration for Scene Recognition. Mathematics. 2025; 13(19):3102. https://doi.org/10.3390/math13193102

Chicago/Turabian Style

Bae, Chan Ho, and Sangtae Ahn. 2025. "Context-Aware Dynamic Integration for Scene Recognition" Mathematics 13, no. 19: 3102. https://doi.org/10.3390/math13193102

APA Style

Bae, C. H., & Ahn, S. (2025). Context-Aware Dynamic Integration for Scene Recognition. Mathematics, 13(19), 3102. https://doi.org/10.3390/math13193102

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Context-Aware Dynamic Integration for Scene Recognition

Abstract

1. Introduction

Highlights of the Proposed Study

2. Related Works

2.1. Semantic-Aware Scene Recognition (SASceneNet)

2.2. Fusion of Object and Scene Network (FOSNet)

2.3. Multiple Representation Network (MRNet)

2.4. Recognize-Anything Model (RAM)

2.5. Contrastive Language-Image Pretraining (CLIP)

2.6. Dynamic Switch Networks

2.7. Recent Surveys and Comprehensive Approaches

3. Methods

3.1. Overall Architecture

3.2. Dataset

3.2.1. MIT Indoor 67

3.2.2. Places365

3.2.3. SUN397

4. Results

4.1. Performance Evaluation

4.2. Comparison with State-of-the-Art Methods

4.3. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI