1. Introduction
With the rapid development of machine vision and the expansion of information technology, image and video data are increasingly being utilized across various domains. Modern life, work, and learning have become highly dependent on visual information. Compared to textual data, images offer greater intuitiveness and stronger expressive power [
1]. A typical image often contains a scene composed of multiple objects. Understanding an image involves identifying these objects to perceive the overall environment. Therefore, scene recognition and object detection are closely related and represent key research directions in the field of machine vision.
Scene recognition refers to the use of computational methods to analyze and interpret visual content, extracting object categories and their spatial positions to understand the scene context. Scenes can be broadly categorized into natural and artificial environments. Natural scenes include both indoor and outdoor settings formed through natural processes, while artificial scenes involve man-made elements such as people, vehicles, buildings, and infrastructure [
2]. In natural scenes, object detection and scene recognition technologies help extract semantic information more effectively, supporting intelligent image classification and retrieval. In artificial scenes, these technologies enable fast identification of objects and environmental contexts, with wide applications in intelligent surveillance systems and autonomous driving.
Indoor scene understanding plays a crucial role in scene classification and is widely applied in smart homes, intelligent image retrieval, indoor monitoring, and service robotics. In smart home systems, these technologies monitor device status for automated control, enhancing user experience—for instance, by adjusting lighting or air conditioning based on room conditions [
3]. In image classification tasks, object detection and scene recognition techniques allow for the automatic extraction of object categories and locations, replacing manual classification, improving efficiency, and reducing labor costs [
4]. In video surveillance, they enable real-time detection of specific objects and events, allowing for immediate alerts and responses, thereby increasing monitoring effectiveness while reducing human intervention. In intelligent service robots, accurate indoor object detection and scene recognition enable better environmental awareness, helping robots perform complex tasks like object grasping and obstacle avoidance, thus improving task completion efficiency. Therefore, developing robust indoor object detection and scene recognition algorithms holds significant research value [
5].
In recent years, researchers around the world have proposed many indoor object detection and scene recognition algorithms, achieving promising results on benchmark datasets such as Places, ADE20K, SUN367, and MIT67. However, in practical scenarios, there often exists a gap between training data and real-world data. Moreover, the complexity of indoor environments—such as mutual occlusion among objects, overlapping backgrounds, blurred images, low-light conditions, and varying camera angles—can significantly affect recognition performance. Therefore, further research is needed to develop algorithms that not only detect objects accurately and recognize scenes efficiently but also maintain strong classification stability and adaptability under real-world conditions.
2. Related Work
Object detection is a key research area in computer vision, consisting of feature extraction and object recognition. Feature extraction involves capturing significant image features from images or videos, while object recognition involves detecting and classifying these features to perceive their semantic and location information. In 2017, an article on Deconvolutional Single Shot Detector (DSSD) [
1] was published at CVPR, which introduced a deconvolution module to the traditional SSD object detection algorithm to enhance the expressive power of shallow feature maps. Feature Fusion Single Shot Multi-box Detector (FSSD) [
2] introduced a lightweight feature fusion module to the traditional SSD algorithm, effectively improving its ability to detect objects of different scales. However, SSD-based algorithms still suffer from large parameter sizes and poor detection performance for small objects. To address these issues, Kang et al. [
3] proposed an alignment matching strategy that combines aspect ratio and center point distance to enhance the SSD algorithm’s ability to detect small objects. Deng et al. [
4] introduced the FPN structure into the SSD algorithm and incorporated an attention mechanism to enrich the semantic information in feature maps. Huo et al. [
5] proposed a small object detection algorithm based on self-attention and feature fusion, combining the SSD algorithm with the EfficientNetV2 network and introducing a CSP-PAN topology to ensure that feature maps contain both low-level object details and high-level semantic features, improving detection accuracy. Qian et al. [
6] proposed an end-to-end feature fusion and feature enhancement object detection algorithm, adding five convolutional layers to generate feature maps of different sizes and using max-pooling and upsampling feature fusion modules to create a new feature pyramid. They also introduced a feature enhancement module to expand the receptive field of output feature maps, incorporating more contextual information to further enhance the model’s feature representation capabilities.
To reduce the parameter size and computational cost of traditional SSD algorithms, many studies have replaced the backbone network of SSD with the more lightweight MobileNet series. Wang et al. [
7] proposed a lightweight object detection algorithm based on depthwise separable convolution (DSC), using MobileNetV2 as the backbone feature extraction network for SSD. They introduced an upsampling feature fusion module and a local–global feature extraction module, added five additional feature layers, and used an improved weighted bidirectional feature pyramid network (BiFPN) for feature fusion, effectively reducing model complexity and improving small object detection speed. Liu et al. [
8] eliminated the lung regions in the category activation maps of chest X-rays through the improved U-Net model to provide visualization. By focusing on the weights of the model within the concerned regions, the interpretability and credibility of artificial intelligence diagnosis were improved. Pichai et al. [
9] improved the U-net architecture and took the dice coefficients, binary cross-entropy, and accuracy as the loss functions. For the classification of barren and dense forest images, the proposed model can achieve 82.51%.
Since the first workshop on scene understanding, the concepts of scene description and scene understanding have been clearly defined, highlighting the forward-looking and innovative nature of scene recognition research. Consequently, scholars and experts have conducted extensive research on scene recognition algorithms, integrating them with algorithms from other fields, leading to the emergence of numerous scene recognition algorithms. Traditional algorithms and deep learning-based algorithms are the two main components of current scene recognition research.
- (1)
Traditional Scene Recognition Algorithms
Scene recognition involves assigning scene category labels to images based on object features to perceive the scene. Traditional scene recognition algorithms use manually designed feature operators to recognize and classify scenes based on low-level features such as shape, texture, and color, including Scale Invariant Feature Transform (SIFT) [
10], GIST [
11], Histogram of Oriented Gradient (HOG) [
12], and CENTRIST [
13]. SIFT, also known as scale-invariant feature transform, maintains feature invariance under image rotation, scaling, brightness adjustment, viewpoint change, and affine transformation. Its advantage lies in maintaining stable local features and generating a large number of features even with a small number of samples, facilitating fusion with other feature vectors. GIST, also known as spatial envelope features, does not require image processing or local feature extraction but uses global features to perceive scenes, enabling rapid scene recognition and classification. HOG, also known as gradient histogram features, accurately describes features in densely overlapping local regions of an image. It uses gradient histograms of local regions to describe image features, effectively avoiding changes in lighting and image shifts, and providing a good description of object edge information. These feature operators are simple to implement but have low detection accuracy. Bag of Visual Words (BOVW) [
14], also known as the bag-of-words model for scene recognition, encodes images by transforming prominent features into text information and integrating them into text combinations, classifying images based on the frequency of text occurrences. Spatial Pyramid Matching (SPM) [
15] is a spatial pyramid-based image recognition and classification algorithm that perceives local image information by statistically analyzing feature point distributions at different resolutions. Compared to the BOVW model, the Fisher Vector (FV) [
16] model is more sensitive to deep semantic information, generating a probability dictionary and calculating mean, variance, and weights to provide a richer description of image features. These methods improve recognition accuracy but are slow, failing to meet practical detection requirements.
Scene images contain complex and diverse objects, making it difficult for single features to effectively describe the entire image. Therefore, many experts and scholars have proposed feature fusion methods to combine single features for better scene perception. Bai [
17] fused SIFT features with Local Binary Pattern (LBP) features, creating encoded text based on SIFT and LBP features in images and organizing the features into a two-dimensional table to recognize and classify scenes based on each feature’s encoded value, significantly improving scene recognition performance. Ding et al. [
18] proposed a scene classification algorithm based on multi-feature fusion and deep belief networks, first extracting color, texture, and shape features from images and fusing these basic features, then using the fused information as input data for a deep belief network model to train samples and achieve scene classification. Shrinivasa et al. [
19] proposed a global image representation method for scene recognition, representing scene images with local transform histograms, an extended version of basic transform histograms. Local transforms include local difference sign and magnitude information, and due to strong constraints between adjacent transform values, histograms and spatial pyramids can be fused to capture global structural information, thereby perceiving the scene in images.
- (2)
Deep Learning-Based Scene Recognition Algorithms
With the continuous development of neural network technology, various convolutional neural network models have emerged. Meanwhile, deep learning-based scene recognition methods have been increasing. Research shows that compared to traditional scene recognition algorithms, deep learning-based scene recognition algorithms have significant advantages in both accuracy and speed. Early deep learning-based scene recognition algorithms combined deep learning theory with visual bag-of-words, using deep learning models to extract features and then creating text encodings to represent images. Jiang et al. [
20] proposed the concept of shared encoding, organically integrating deep learning, conceptual features, and local feature encoding techniques to fuse original features with shared encodings, generating more comprehensive image classification features that can adaptively extract features for different scene classification tasks. YEE et al. [
15] used mid-level convolutional layer image features for scene recognition. When scene layouts varied significantly, structured convolutional activations were transformed into another high-resolution feature space, where the information not only included scene dataset categories but also encoded general scene category features present in indoor scenes, thereby improving algorithm performance. Wang et al. [
21] proposed a scene recognition method that combines traditional local feature encoding with convolutional neural networks, leveraging the detection capabilities of convolutional neural networks and the expressive power of local features. They used supervised learning for feature extraction and mapped local features to global features using PatchNet’s semantic probabilities to achieve scene recognition.
When the human brain observes an image, it typically prioritizes information of interest. Some experts and scholars have leveraged this characteristic to propose a series of scene recognition algorithms based on salient objects, achieving good accuracy. Khan et al. [
22] proposed an algorithm based on mid-level image saliency features, using convolutional neural networks to extract salient features of objects and employing Lasso [
23] regularization to distinguish scene categories corresponding to different features. López et al. [
24] proposed a hybrid representation method using convolutional neural networks and descriptive encoding, with image patches and scene labels as input and output, achieving good results on MIT67 and SUN397. K.P. Ajitha Gladis et al. [
25] proposed a video based on an object detection model to assist visually impaired people. The proposed method uses a new squeeze excitation and attention YOLO network based on the adaptive spatial pyramid to detect the target. Wang et al. [
26] proposed an improved YOLOv8 algorithm (YOLOv8-CBW3). Firstly, embed the GAM global attention mechanism into the backbone network of YOLOv8 to enhance the sensitivity of the backbone network to important feature information and reduce the focus on irrelevant features. After comparing and analyzing the attention mechanisms of GAM and CBAM, we decided to embed the GAM global attention mechanism into the backbone network of YOLOv8, SE, and ECA. Secondly, at the neck layer, the original PAN-FPN network structure is replaced with the BiFPN structure of the neck layer to achieve bidirectional cross-scale connectivity and weighted feature fusion. In summary, traditional scene recognition algorithms are simple in principle but generally suffer from low detection accuracy and slow recognition speed. With the emergence of deep learning algorithms, significant progress has been made in scene recognition using deep learning methods. As the number of images has grown dramatically, scene categories have also increased, and scene images have become more complex with densely distributed objects. Extensive experiments have shown that existing scene recognition algorithms are complex in structure and lack stability, with deficiencies in accuracy and speed in practical applications. Therefore, object detection and scene recognition algorithms that balance detection accuracy and speed while maintaining classification stability remain a research challenge. Indoor scenes are complex, with dim lighting, a wide variety of objects, and frequent mutual occlusion. Therefore, research on indoor object detection and scene recognition algorithms must consider not only how to quickly and accurately extract objects of interest and precisely recognize scenes but also the algorithm’s stability and adaptability to practical detection requirements. Our main contributions are as follows:
- (1)
To address the issues of complex indoor environments and mutual occlusion of objects that are difficult to detect in indoor object detection, we designed a Mobile-EFSSD-based indoor object detection algorithm. Specific improvements include using an improved MobileNetV3 as the backbone network for the SSD algorithm, replacing the SE attention mechanism with the ECA attention mechanism in the bneck structure, and extracting feature maps of different scales and fusing shallow and deep feature maps using the FPN structure.
- (2)
To address the issues of complex structures and poor classification stability in existing indoor scene recognition algorithms, we designed a naive Bayes-based indoor scene recognition algorithm. Specifically, the indoor object detection results are transformed into text features, and the Apriori algorithm is used to mine association rules between objects; the prior probabilities of objects and strongly associated object sets in scenes are calculated; based on the prior probabilities, the naive Bayes algorithm is used to recognize scenes.
3. Methodology
3.1. Overall Framework
To address the issues of mutually occluded objects in the SSD algorithm, this paper proposes a Mobile-EFSSD-based indoor object detection algorithm. Based on the SSD object detection algorithm, an improved MobileNetV3 network is used as the backbone network for feature extraction. The FPN feature fusion module is introduced to fuse seven feature maps of different scales, enhancing the network’s feature representation capabilities. Additionally, the loss function is improved to enhance the algorithm’s stability. The improved network structure is shown in
Figure 1.
The Mobile-EFSSD algorithm uses a modified MobileNetV3 network as the front-end framework. It retains the convolutional layers before bneck15 and removes the last three convolutional and pooling layers. Additionally, four new convolutional layers—extra1, extra2, extra3, and extra4—are added. Feature maps are extracted from bneck6, bneck11, bneck15, extra1, extra2, extra3, and extra4, and fused using the Feature Pyramid Network (FPN) feature fusion module. On the fused feature maps, six bounding boxes of different scales are constructed for each point. Non-Maximum Suppression (NMS) is applied to detect and filter these bounding boxes, eliminating overlapping or incorrect ones. Finally, the Softmax classification and regression algorithms are used to obtain the category and location information of the objects.
3.2. Improvements to the Feature Extraction Module
The improved SSD object detection algorithm uses a modified MobileNetV3 as the feature extraction network. The last three convolutional and pooling layers of MobileNetV3 are removed, and four additional convolutional layers are added. Furthermore, the Squeeze-and-Excitation (SE) attention mechanism in the linear bottleneck inverted residual structure (bneck) is replaced with the Efficient Channel Attention (ECA) attention mechanism.
The essence of the linear bottleneck inverted residual structure is the use of a linear activation function in the inverted residual structure. The bneck structure is the core component of the MobileNetV3 network. Unlike the bneck structure in MobileNetV2, MobileNetV3 introduces a lightweight attention mechanism, SE, into the bneck.
In the bneck structure, the feature map first undergoes dimensionality reduction through a 1 × 1 standard convolution, followed by feature extraction using a 3 × 3 depthwise convolution. It is then connected to the SE attention mechanism to obtain a weighted feature matrix. Finally, the weight matrix is multiplied channel-wise with the original feature map, and a 1 × 1 convolution is used to restore the original feature map size, resulting in a weighted feature map. The structure of the bneck is illustrated in
Figure 2.
The SE (Squeeze and Excitation) attention mechanism first performs pooling operations on image features and then processes these features through two fully connected layers. Finally, the weights are calculated through the Sigmoid function, and these weights are multiplied by channels with the original feature map to obtain the output. The SE attention mechanism controls the number of parameters through dimensionality reduction operations. However, this method will weaken the direct communication between channels, thereby affecting the accuracy of channel weights. Furthermore, the two fully connected layers used consecutively in the SE attention mechanism will also affect the detection speed to a certain extent.
The ECA (Efficient Channel Attention) attention mechanism first processes the features of the image through global average pooling. It then passes through two fully connected-like layers with adaptive convolutional kernels. Finally, the feature weight matrix is obtained using the Sigmoid activation function and multiplied channel-wise with the original feature map to produce the output. Since the ECA attention mechanism replaces fully connected layers with fully connected-like layers, it effectively avoids the negative effects of dimensionality reduction, reducing parameters while increasing speed. Therefore, in this paper, the SE attention mechanism in the bneck is replaced with the ECA attention mechanism. The improved structure of the bneck is illustrated in
Figure 3.
3.3. Introduction of the Feature Fusion Module
In convolutional neural networks, high-level layers with low resolution contain more semantic features, while low-level layers with high resolution contain more geometric features. The traditional SSD (Single Shot MultiBox Detector) object detection algorithm uses a pyramidal feature hierarchy structure, extracting feature maps from different convolutional layers for detection and classification, enabling the recognition of objects at various scales. However, the traditional SSD algorithm extracts fewer shallow feature maps, resulting in insufficiently rich feature extraction from shallow layers, which is detrimental to detecting mutually occluded objects. In indoor object detection, due to variations in image size and shooting angles, images often contain a large number of mutual occlusions is common. Therefore, this paper introduces the FPN (Feature Pyramid Network) feature fusion module after the feature extraction module of the improved SSD algorithm to fuse shallow and deep feature maps, thereby improving the accuracy of detecting mutually occluded objects. The principle of the FPN feature fusion module is illustrated in
Figure 4.
As shown in
Figure 4, the detection steps of the FPN feature fusion module are as follows: First, a 2× upsampling operation is applied to Layer 3. Second, a 1 × 1 2D convolution is used to reduce the dimensionality of Layer 2. Finally, the two are added channel-wise to obtain Layer 5, achieving the feature fusion operation. Similarly, Layer 6 is obtained by fusing Layer 5 and Layer 1.
We fuse the output feature layers of bneck6, bneck11, bneck15, extra1, extra2, extra3, and extra4 to obtain six feature maps with multi-scale feature weights. This enhances the detection accuracy for mutually occluded objects.
3.4. Improvement of the Loss Function
The loss function is typically used to quantify the error between the ground truth and the predicted values, reducing the gap between the true results and the predicted results, and enhancing the accuracy of the model’s predictions. The loss function of the traditional SSD object detection algorithm consists of two parts: localization loss and classification loss, as shown in Equation (1). The localization loss uses the smooth L1 norm loss, which represents the average error magnitude using the absolute difference between the ground truth and the predicted values, as shown in Equations (2) and (3). The classification loss uses the cross-entropy loss function, where the magnitude of the cross-entropy value reflects the quality of the model, as shown in Equations (4) and (5).
where
Np denotes the number of a priori frames matched with the real frame,
e takes the value of 0 or 1, indicating the matching of the anchor frame with the real frame,
c denotes the predicted value of the category confidence,
l denotes the predicted position of the bounding box corresponding to the a priori frame,
g denotes the positional parameter of the real frame, and
p ∈ [0, 10] denotes the number of object categories,
i = 1, 2, …,
n, and
n denotes the number of prior boxes,
j = 1, 2, …,
m,
m denotes the number of real frames.
The
L1 norm loss exhibits instability when handling outliers due to its derivative being zero at the origin. The cross-entropy loss function, which relies on the stability of the model, is prone to misjudgment in detection tasks. To address these issues, this paper proposes an enhancement to the loss function: the incorporation of a hyperparameter into the smooth
L1 norm loss and the adoption of the focal loss function for category loss. The refined loss function is presented in Equations (6) and (7).
where
is determined by Equation (5), where
i denotes the number of anchor boxes,
j signifies the number of ground truth boxes, and
p represents the number of object categories, ranging within the interval [0, 10]. The hyperparameters are set as
τ = 0.2 and
γ = 0.5.
The enhanced localization loss function, augmented with hyperparameters, bolsters the stability of the algorithm. The confidence loss, employing the focal loss function, mitigates the incidence of misdetection, thereby substantially fortifying the algorithm’s robustness.
3.5. Establishment of Indoor Scene Association Rules
Indoor scene images contain a wide variety of objects. If a single object is used as the image feature and input into the naive Bayes classification algorithm for scene recognition, the recognition accuracy will be low, making it unsuitable for practical detection requirements. The Apriori algorithm first traverses the dataset to gradually generate candidate item sets. It then filters the candidate item sets using a predefined support threshold to generate frequent item sets. Finally, it calculates the confidence of each frequent item set and compares it with a predefined threshold to derive association rules between objects. The Apriori algorithm has a simple structure and can handle large-scale datasets. Therefore, this paper uses the Apriori algorithm to mine association rules between objects in indoor scene images, which are then used as features of the scene images and input into the naive Bayes classification algorithm to determine the scene category.
Indoor scenes are divided into six categories: bedroom, bathroom, dining room, kitchen, living room, and office. Indoor objects are divided into ten categories: storage cabinet, chair, table, sofa, vase, toilet, door, refrigerator, washing machine, and bed. To mine association rules between objects using the Apriori algorithm, the minimum thresholds for support and confidence must be set before running the algorithm. Due to the large volume of experimental data in this study, multiple experiments were conducted. When the support is set to 40% and the confidence to 60%, the results are optimal. The specific steps are as follows:
(1) The dataset is read, encompassing categories of scenes and objects, defining the scene transaction set and the object feature item set. The scene collection and the object feature collection are denoted by Y = {y1, y2, …yj, …y6} and X = {a1, a2, …, ai, …, a10}, respectively.
(2) The dataset is traversed to establish the candidate item set Cp.
(3) The support degree for each candidate item set is calculated, with the computational formula presented as Equation (8).
where
am,
an denote the object features in the candidate item set.
Further, it is possible to derive a formula for calculating the support of
am with respect to
an as shown in Equation (9).
where
NC denotes the total number of candidate item sets.
Specifically, the formula is shown in Equation (10).
(4) Filter the set of candidate items according to the set support threshold to form the frequent item set Lq.
(5) Calculate the confidence level of each frequent item set, and the calculation formula is shown in Equation (11).
where
am,
an denote the object features in the candidate item set, and
q is the number of frequent item sets.
Further, it is possible to derive the formula for calculating the confidence level of
am with respect to
an as shown in Equation (12).
where
NC denotes the total number of candidate item sets.
Specifically, the formula is shown in Equation (13).
where
am and
an denote the object features in the candidate item set.
(6) Filter the frequent item set according to the set confidence threshold and generate association rules.
Some of the association rules established using the Apriori algorithm are shown in
Table 1.
3.6. Plain Bayesian Indoor Scene Recognition Algorithm
The plain Bayesian Indoor Scene Recognition Algorithm uses the objects in the image and the mined association rules as the features of the scene, using the indoor object detection algorithm to derive the categories of the objects, mining the association rules in it through the Apriori algorithm, and using the plain Bayesian classification algorithm to recognize the scene. It can perceive the scene while detecting the object category and location, and the specific detection process is shown in
Figure 5.
The detection procedure of the scene recognition algorithm based on naive Bayes is as follows:
- (1)
Generation of Textual Features
The improved SSD algorithm is employed to extract category information of objects within the images, which is subsequently transformed into textual features.
- (2)
Definition of Scene Category Set and Object Feature Set
Consistent with the definition method described in the aforementioned Apriori algorithm, the scene category set is defined as Y = {y1, y2, …yj, …, y6}, and the object feature set as X = {a1, a2, …, ai, …, a10}.
- (3)
Mining of Association Rules
Utilizing the Apriori algorithm, association rules among objects are mined based on the predefined support and confidence thresholds (support at 40% and confidence at 60%).
- (4)
Calculation of Prior Probabilities
Objects, along with sets of objects that exhibit strong association rules, are considered features of the scene. The prior probabilities
P(
yj) and
P(
fm|
yj) are calculated, with the computational formulas provided as Equations (14) and (15).
where
fm denotes the features of a scene, encompassing both individual objects and sets of objects with robust association rules.
represents the number of samples for the scene category
yj,
NA signifies the total number of samples,
indicates the count of feature
fm within the scene category
yj,
stands for the total number of samples for feature
fm,
m is the number of features,
j = 1, 2, …, 6 corresponds to the number of scene categories, and
k = 6 is the total number of scene categories.
In practical classification scenarios, the absence of a particular feature in the dataset can result in a classification probability of zero, thereby compromising the detection outcome. To circumvent this issue, we incorporate Laplace smoothing into the naive Bayes classification algorithm. Laplace smoothing operates on the principle of adding one to the count of each feature to estimate the probability of a feature not present in the sample, as delineated by Equations (16) and (17). Given the substantial size of the sample set, the increment of each feature’s count by one exerts a negligible influence on the classification results. Consequently, Laplace smoothing effectively mitigates the zero probability conundrum.
where
denotes the number of samples for the scene category
yj,
NA represents the total number of samples,
fm signifies the features of a scene, which include individual objects and sets of objects with strong association rules,
indicates the count of feature
fm within the scene category
yj,
stands for the total number of samples for feature
fm,
m is the number of features,
yj represents the scene category,
j = 1, 2, …, 6 corresponds to the number of scene categories, and
k = 6 is the total number of scene categories.
- (5)
Derivation of Scene Recognition Results
Based on the prior probabilities, the naive Bayes classification algorithm is utilized to select the maximum value as the scene recognition result, with the computational formula presented as Equation (18).
where
fm denotes the features of a scene, encompassing both individual objects and sets of objects with robust association rules,
m represents the number of features,
yj signifies the scene category, and
j = 1, 2, …, 6 corresponds to the number of scene categories.
The naive Bayes indoor scene recognition algorithm employs the aforementioned object detection algorithm to acquire object features, utilizes the Apriori algorithm to mine association rules among objects, and identifies scenes through the naive Bayes algorithm. The algorithm’s strengths lie in its capability to handle multi-classification tasks, recognizing scenes while detecting object categories and locations, and its simple structure coupled with high classification stability.
5. Conclusions
With the rapid development of deep learning and image processing technologies, scene recognition algorithms have emerged in large numbers. Indoor scenes are an important part of scene classification, and indoor scene recognition technology is widely used in fields such as smart homes, video surveillance, intelligent service robots, image classification, and retrieval. However, single-scene recognition techniques are difficult to apply effectively in practical scenarios. For example, intelligent service robots use scene recognition technology to perceive the current environment and object detection technology to identify objects in images, enabling them to execute corresponding commands. Therefore, simultaneously detecting the categories and locations of objects in images and accurately perceiving the scene is a current research hotspot. In the process of researching indoor object detection and scene recognition algorithms, the work done in this paper is as follows:
(1) A Mobile-EFSSD-based indoor object detection algorithm is designed. The specific improvements are as follows: First, an improved MobileNetV3 is used as the backbone network for the SSD algorithm. Second, the FPN feature fusion module is introduced to fuse high-level and low-level feature maps. Finally, a hyperparameter is added to the localization loss function, and the confidence loss function is replaced with the focal loss function.
(2) A naive Bayes-based indoor scene recognition algorithm is designed. The detection steps of this algorithm are as follows: Firstly, convert the indoor target detection results described in
Section 3.1 into text features. Second, the Apriori algorithm is used to mine association rules between objects in the scene, and the prior probabilities of objects and strongly associated object sets appearing in the scene are calculated. Finally, based on the prior probabilities, the naive Bayes classification algorithm is used to determine the scene category.
(3) Although our primary focus was on improving detection performance in indoor scenes, we carefully monitored the training process to assess potential overfitting. Throughout training, we observed consistent trends between training and validation performance in terms of mAP and loss values, with no significant performance drops from training to validation sets. This suggests that the model generalizes well to unseen data.
(4) In future work, we will actively collect more image samples from real life to further enhance the generalization ability and practicality of the model. We also plan to open-source these expanded datasets to provide more valuable references for the community.
Additionally, the lightweight design of Mobile-EFSSD inherently helps reduce overfitting risk by maintaining a balanced model complexity relative to the dataset size. We also applied common data augmentation techniques during training (e.g., random cropping, color jitter, horizontal flipping), which further improve generalization. While we did not explicitly implement advanced regularization methods such as dropout or weight decay, the overall architecture and training strategy demonstrate stable convergence behavior without signs of severe overfitting.