Research on Indoor Object Detection and Scene Recognition Algorithm Based on Apriori Algorithm and Mobile-EFSSD Model

Zheng, Wenda; Ai, Yibo; Zhang, Weidong

doi:10.3390/math13152408

Open AccessArticle

Research on Indoor Object Detection and Scene Recognition Algorithm Based on Apriori Algorithm and Mobile-EFSSD Model

by

Wenda Zheng

,

Yibo Ai

and

Weidong Zhang

^*

National Center for Materials Service Safety, University of Science and Technology Beijing, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(15), 2408; https://doi.org/10.3390/math13152408

Submission received: 17 May 2025 / Revised: 17 July 2025 / Accepted: 24 July 2025 / Published: 26 July 2025

Download

Browse Figures

Versions Notes

Abstract

With the advancement of computer vision and image processing technologies, scene recognition has gradually become a research hotspot. However, in practical applications, it is necessary to detect the categories and locations of objects in images while recognizing scenes. To address these issues, this paper proposes an indoor object detection and scene recognition algorithm based on the Apriori algorithm and the Mobile-EFSSD model, which can simultaneously obtain object category and location information while recognizing scenes. The specific research contents are as follows: (1) To address complex indoor scenes and occlusion, this paper proposes an improved Mobile-EFSSD object detection algorithm. An optimized MobileNetV3 with ECA attention is used as the backbone. Multi-scale feature maps are fused via FPN. The localization loss includes a hyperparameter, and focal loss replaces confidence loss. Experiments show that the method achieves stable performance, effectively detects occluded objects, and accurately extracts category and location information. (2) To improve classification stability in indoor scene recognition, this paper proposes a naive Bayes-based method. Object detection results are converted into text features, and the Apriori algorithm extracts object associations. Prior probabilities are calculated and fed into a naive Bayes classifier for scene recognition. Evaluated using the ADE20K dataset, the method outperforms existing approaches by achieving a better accuracy–speed trade-off and enhanced classification stability. The proposed algorithm is applied to indoor scene images, enabling the simultaneous acquisition of object categories and location information while recognizing scenes. Moreover, the algorithm has a simple structure, with an object detection average precision of 82.7% and a scene recognition average accuracy of 95.23%, making it suitable for practical detection requirements.

Keywords:

deep learning; indoor object detection; scene recognition; SSD algorithm; naive Bayes; Apriori algorithm

MSC:

68T45; 68T07

1. Introduction

With the rapid development of machine vision and the expansion of information technology, image and video data are increasingly being utilized across various domains. Modern life, work, and learning have become highly dependent on visual information. Compared to textual data, images offer greater intuitiveness and stronger expressive power [1]. A typical image often contains a scene composed of multiple objects. Understanding an image involves identifying these objects to perceive the overall environment. Therefore, scene recognition and object detection are closely related and represent key research directions in the field of machine vision.

Scene recognition refers to the use of computational methods to analyze and interpret visual content, extracting object categories and their spatial positions to understand the scene context. Scenes can be broadly categorized into natural and artificial environments. Natural scenes include both indoor and outdoor settings formed through natural processes, while artificial scenes involve man-made elements such as people, vehicles, buildings, and infrastructure [2]. In natural scenes, object detection and scene recognition technologies help extract semantic information more effectively, supporting intelligent image classification and retrieval. In artificial scenes, these technologies enable fast identification of objects and environmental contexts, with wide applications in intelligent surveillance systems and autonomous driving.

Indoor scene understanding plays a crucial role in scene classification and is widely applied in smart homes, intelligent image retrieval, indoor monitoring, and service robotics. In smart home systems, these technologies monitor device status for automated control, enhancing user experience—for instance, by adjusting lighting or air conditioning based on room conditions [3]. In image classification tasks, object detection and scene recognition techniques allow for the automatic extraction of object categories and locations, replacing manual classification, improving efficiency, and reducing labor costs [4]. In video surveillance, they enable real-time detection of specific objects and events, allowing for immediate alerts and responses, thereby increasing monitoring effectiveness while reducing human intervention. In intelligent service robots, accurate indoor object detection and scene recognition enable better environmental awareness, helping robots perform complex tasks like object grasping and obstacle avoidance, thus improving task completion efficiency. Therefore, developing robust indoor object detection and scene recognition algorithms holds significant research value [5].

In recent years, researchers around the world have proposed many indoor object detection and scene recognition algorithms, achieving promising results on benchmark datasets such as Places, ADE20K, SUN367, and MIT67. However, in practical scenarios, there often exists a gap between training data and real-world data. Moreover, the complexity of indoor environments—such as mutual occlusion among objects, overlapping backgrounds, blurred images, low-light conditions, and varying camera angles—can significantly affect recognition performance. Therefore, further research is needed to develop algorithms that not only detect objects accurately and recognize scenes efficiently but also maintain strong classification stability and adaptability under real-world conditions.

2. Related Work

Object detection is a key research area in computer vision, consisting of feature extraction and object recognition. Feature extraction involves capturing significant image features from images or videos, while object recognition involves detecting and classifying these features to perceive their semantic and location information. In 2017, an article on Deconvolutional Single Shot Detector (DSSD) [1] was published at CVPR, which introduced a deconvolution module to the traditional SSD object detection algorithm to enhance the expressive power of shallow feature maps. Feature Fusion Single Shot Multi-box Detector (FSSD) [2] introduced a lightweight feature fusion module to the traditional SSD algorithm, effectively improving its ability to detect objects of different scales. However, SSD-based algorithms still suffer from large parameter sizes and poor detection performance for small objects. To address these issues, Kang et al. [3] proposed an alignment matching strategy that combines aspect ratio and center point distance to enhance the SSD algorithm’s ability to detect small objects. Deng et al. [4] introduced the FPN structure into the SSD algorithm and incorporated an attention mechanism to enrich the semantic information in feature maps. Huo et al. [5] proposed a small object detection algorithm based on self-attention and feature fusion, combining the SSD algorithm with the EfficientNetV2 network and introducing a CSP-PAN topology to ensure that feature maps contain both low-level object details and high-level semantic features, improving detection accuracy. Qian et al. [6] proposed an end-to-end feature fusion and feature enhancement object detection algorithm, adding five convolutional layers to generate feature maps of different sizes and using max-pooling and upsampling feature fusion modules to create a new feature pyramid. They also introduced a feature enhancement module to expand the receptive field of output feature maps, incorporating more contextual information to further enhance the model’s feature representation capabilities.

To reduce the parameter size and computational cost of traditional SSD algorithms, many studies have replaced the backbone network of SSD with the more lightweight MobileNet series. Wang et al. [7] proposed a lightweight object detection algorithm based on depthwise separable convolution (DSC), using MobileNetV2 as the backbone feature extraction network for SSD. They introduced an upsampling feature fusion module and a local–global feature extraction module, added five additional feature layers, and used an improved weighted bidirectional feature pyramid network (BiFPN) for feature fusion, effectively reducing model complexity and improving small object detection speed. Liu et al. [8] eliminated the lung regions in the category activation maps of chest X-rays through the improved U-Net model to provide visualization. By focusing on the weights of the model within the concerned regions, the interpretability and credibility of artificial intelligence diagnosis were improved. Pichai et al. [9] improved the U-net architecture and took the dice coefficients, binary cross-entropy, and accuracy as the loss functions. For the classification of barren and dense forest images, the proposed model can achieve 82.51%.

Since the first workshop on scene understanding, the concepts of scene description and scene understanding have been clearly defined, highlighting the forward-looking and innovative nature of scene recognition research. Consequently, scholars and experts have conducted extensive research on scene recognition algorithms, integrating them with algorithms from other fields, leading to the emergence of numerous scene recognition algorithms. Traditional algorithms and deep learning-based algorithms are the two main components of current scene recognition research.

(1): Traditional Scene Recognition Algorithms

Scene recognition involves assigning scene category labels to images based on object features to perceive the scene. Traditional scene recognition algorithms use manually designed feature operators to recognize and classify scenes based on low-level features such as shape, texture, and color, including Scale Invariant Feature Transform (SIFT) [10], GIST [11], Histogram of Oriented Gradient (HOG) [12], and CENTRIST [13]. SIFT, also known as scale-invariant feature transform, maintains feature invariance under image rotation, scaling, brightness adjustment, viewpoint change, and affine transformation. Its advantage lies in maintaining stable local features and generating a large number of features even with a small number of samples, facilitating fusion with other feature vectors. GIST, also known as spatial envelope features, does not require image processing or local feature extraction but uses global features to perceive scenes, enabling rapid scene recognition and classification. HOG, also known as gradient histogram features, accurately describes features in densely overlapping local regions of an image. It uses gradient histograms of local regions to describe image features, effectively avoiding changes in lighting and image shifts, and providing a good description of object edge information. These feature operators are simple to implement but have low detection accuracy. Bag of Visual Words (BOVW) [14], also known as the bag-of-words model for scene recognition, encodes images by transforming prominent features into text information and integrating them into text combinations, classifying images based on the frequency of text occurrences. Spatial Pyramid Matching (SPM) [15] is a spatial pyramid-based image recognition and classification algorithm that perceives local image information by statistically analyzing feature point distributions at different resolutions. Compared to the BOVW model, the Fisher Vector (FV) [16] model is more sensitive to deep semantic information, generating a probability dictionary and calculating mean, variance, and weights to provide a richer description of image features. These methods improve recognition accuracy but are slow, failing to meet practical detection requirements.

Scene images contain complex and diverse objects, making it difficult for single features to effectively describe the entire image. Therefore, many experts and scholars have proposed feature fusion methods to combine single features for better scene perception. Bai [17] fused SIFT features with Local Binary Pattern (LBP) features, creating encoded text based on SIFT and LBP features in images and organizing the features into a two-dimensional table to recognize and classify scenes based on each feature’s encoded value, significantly improving scene recognition performance. Ding et al. [18] proposed a scene classification algorithm based on multi-feature fusion and deep belief networks, first extracting color, texture, and shape features from images and fusing these basic features, then using the fused information as input data for a deep belief network model to train samples and achieve scene classification. Shrinivasa et al. [19] proposed a global image representation method for scene recognition, representing scene images with local transform histograms, an extended version of basic transform histograms. Local transforms include local difference sign and magnitude information, and due to strong constraints between adjacent transform values, histograms and spatial pyramids can be fused to capture global structural information, thereby perceiving the scene in images.

(2): Deep Learning-Based Scene Recognition Algorithms

With the continuous development of neural network technology, various convolutional neural network models have emerged. Meanwhile, deep learning-based scene recognition methods have been increasing. Research shows that compared to traditional scene recognition algorithms, deep learning-based scene recognition algorithms have significant advantages in both accuracy and speed. Early deep learning-based scene recognition algorithms combined deep learning theory with visual bag-of-words, using deep learning models to extract features and then creating text encodings to represent images. Jiang et al. [20] proposed the concept of shared encoding, organically integrating deep learning, conceptual features, and local feature encoding techniques to fuse original features with shared encodings, generating more comprehensive image classification features that can adaptively extract features for different scene classification tasks. YEE et al. [15] used mid-level convolutional layer image features for scene recognition. When scene layouts varied significantly, structured convolutional activations were transformed into another high-resolution feature space, where the information not only included scene dataset categories but also encoded general scene category features present in indoor scenes, thereby improving algorithm performance. Wang et al. [21] proposed a scene recognition method that combines traditional local feature encoding with convolutional neural networks, leveraging the detection capabilities of convolutional neural networks and the expressive power of local features. They used supervised learning for feature extraction and mapped local features to global features using PatchNet’s semantic probabilities to achieve scene recognition.

When the human brain observes an image, it typically prioritizes information of interest. Some experts and scholars have leveraged this characteristic to propose a series of scene recognition algorithms based on salient objects, achieving good accuracy. Khan et al. [22] proposed an algorithm based on mid-level image saliency features, using convolutional neural networks to extract salient features of objects and employing Lasso [23] regularization to distinguish scene categories corresponding to different features. López et al. [24] proposed a hybrid representation method using convolutional neural networks and descriptive encoding, with image patches and scene labels as input and output, achieving good results on MIT67 and SUN397. K.P. Ajitha Gladis et al. [25] proposed a video based on an object detection model to assist visually impaired people. The proposed method uses a new squeeze excitation and attention YOLO network based on the adaptive spatial pyramid to detect the target. Wang et al. [26] proposed an improved YOLOv8 algorithm (YOLOv8-CBW3). Firstly, embed the GAM global attention mechanism into the backbone network of YOLOv8 to enhance the sensitivity of the backbone network to important feature information and reduce the focus on irrelevant features. After comparing and analyzing the attention mechanisms of GAM and CBAM, we decided to embed the GAM global attention mechanism into the backbone network of YOLOv8, SE, and ECA. Secondly, at the neck layer, the original PAN-FPN network structure is replaced with the BiFPN structure of the neck layer to achieve bidirectional cross-scale connectivity and weighted feature fusion. In summary, traditional scene recognition algorithms are simple in principle but generally suffer from low detection accuracy and slow recognition speed. With the emergence of deep learning algorithms, significant progress has been made in scene recognition using deep learning methods. As the number of images has grown dramatically, scene categories have also increased, and scene images have become more complex with densely distributed objects. Extensive experiments have shown that existing scene recognition algorithms are complex in structure and lack stability, with deficiencies in accuracy and speed in practical applications. Therefore, object detection and scene recognition algorithms that balance detection accuracy and speed while maintaining classification stability remain a research challenge. Indoor scenes are complex, with dim lighting, a wide variety of objects, and frequent mutual occlusion. Therefore, research on indoor object detection and scene recognition algorithms must consider not only how to quickly and accurately extract objects of interest and precisely recognize scenes but also the algorithm’s stability and adaptability to practical detection requirements. Our main contributions are as follows:

(1): To address the issues of complex indoor environments and mutual occlusion of objects that are difficult to detect in indoor object detection, we designed a Mobile-EFSSD-based indoor object detection algorithm. Specific improvements include using an improved MobileNetV3 as the backbone network for the SSD algorithm, replacing the SE attention mechanism with the ECA attention mechanism in the bneck structure, and extracting feature maps of different scales and fusing shallow and deep feature maps using the FPN structure.
(2): To address the issues of complex structures and poor classification stability in existing indoor scene recognition algorithms, we designed a naive Bayes-based indoor scene recognition algorithm. Specifically, the indoor object detection results are transformed into text features, and the Apriori algorithm is used to mine association rules between objects; the prior probabilities of objects and strongly associated object sets in scenes are calculated; based on the prior probabilities, the naive Bayes algorithm is used to recognize scenes.

3. Methodology

3.1. Overall Framework

To address the issues of mutually occluded objects in the SSD algorithm, this paper proposes a Mobile-EFSSD-based indoor object detection algorithm. Based on the SSD object detection algorithm, an improved MobileNetV3 network is used as the backbone network for feature extraction. The FPN feature fusion module is introduced to fuse seven feature maps of different scales, enhancing the network’s feature representation capabilities. Additionally, the loss function is improved to enhance the algorithm’s stability. The improved network structure is shown in Figure 1.

The Mobile-EFSSD algorithm uses a modified MobileNetV3 network as the front-end framework. It retains the convolutional layers before bneck15 and removes the last three convolutional and pooling layers. Additionally, four new convolutional layers—extra1, extra2, extra3, and extra4—are added. Feature maps are extracted from bneck6, bneck11, bneck15, extra1, extra2, extra3, and extra4, and fused using the Feature Pyramid Network (FPN) feature fusion module. On the fused feature maps, six bounding boxes of different scales are constructed for each point. Non-Maximum Suppression (NMS) is applied to detect and filter these bounding boxes, eliminating overlapping or incorrect ones. Finally, the Softmax classification and regression algorithms are used to obtain the category and location information of the objects.

3.2. Improvements to the Feature Extraction Module

The improved SSD object detection algorithm uses a modified MobileNetV3 as the feature extraction network. The last three convolutional and pooling layers of MobileNetV3 are removed, and four additional convolutional layers are added. Furthermore, the Squeeze-and-Excitation (SE) attention mechanism in the linear bottleneck inverted residual structure (bneck) is replaced with the Efficient Channel Attention (ECA) attention mechanism.

The essence of the linear bottleneck inverted residual structure is the use of a linear activation function in the inverted residual structure. The bneck structure is the core component of the MobileNetV3 network. Unlike the bneck structure in MobileNetV2, MobileNetV3 introduces a lightweight attention mechanism, SE, into the bneck.

In the bneck structure, the feature map first undergoes dimensionality reduction through a 1 × 1 standard convolution, followed by feature extraction using a 3 × 3 depthwise convolution. It is then connected to the SE attention mechanism to obtain a weighted feature matrix. Finally, the weight matrix is multiplied channel-wise with the original feature map, and a 1 × 1 convolution is used to restore the original feature map size, resulting in a weighted feature map. The structure of the bneck is illustrated in Figure 2.

The SE (Squeeze and Excitation) attention mechanism first performs pooling operations on image features and then processes these features through two fully connected layers. Finally, the weights are calculated through the Sigmoid function, and these weights are multiplied by channels with the original feature map to obtain the output. The SE attention mechanism controls the number of parameters through dimensionality reduction operations. However, this method will weaken the direct communication between channels, thereby affecting the accuracy of channel weights. Furthermore, the two fully connected layers used consecutively in the SE attention mechanism will also affect the detection speed to a certain extent.

The ECA (Efficient Channel Attention) attention mechanism first processes the features of the image through global average pooling. It then passes through two fully connected-like layers with adaptive convolutional kernels. Finally, the feature weight matrix is obtained using the Sigmoid activation function and multiplied channel-wise with the original feature map to produce the output. Since the ECA attention mechanism replaces fully connected layers with fully connected-like layers, it effectively avoids the negative effects of dimensionality reduction, reducing parameters while increasing speed. Therefore, in this paper, the SE attention mechanism in the bneck is replaced with the ECA attention mechanism. The improved structure of the bneck is illustrated in Figure 3.

3.3. Introduction of the Feature Fusion Module

In convolutional neural networks, high-level layers with low resolution contain more semantic features, while low-level layers with high resolution contain more geometric features. The traditional SSD (Single Shot MultiBox Detector) object detection algorithm uses a pyramidal feature hierarchy structure, extracting feature maps from different convolutional layers for detection and classification, enabling the recognition of objects at various scales. However, the traditional SSD algorithm extracts fewer shallow feature maps, resulting in insufficiently rich feature extraction from shallow layers, which is detrimental to detecting mutually occluded objects. In indoor object detection, due to variations in image size and shooting angles, images often contain a large number of mutual occlusions is common. Therefore, this paper introduces the FPN (Feature Pyramid Network) feature fusion module after the feature extraction module of the improved SSD algorithm to fuse shallow and deep feature maps, thereby improving the accuracy of detecting mutually occluded objects. The principle of the FPN feature fusion module is illustrated in Figure 4.

As shown in Figure 4, the detection steps of the FPN feature fusion module are as follows: First, a 2× upsampling operation is applied to Layer 3. Second, a 1 × 1 2D convolution is used to reduce the dimensionality of Layer 2. Finally, the two are added channel-wise to obtain Layer 5, achieving the feature fusion operation. Similarly, Layer 6 is obtained by fusing Layer 5 and Layer 1.

We fuse the output feature layers of bneck6, bneck11, bneck15, extra1, extra2, extra3, and extra4 to obtain six feature maps with multi-scale feature weights. This enhances the detection accuracy for mutually occluded objects.

3.4. Improvement of the Loss Function

The loss function is typically used to quantify the error between the ground truth and the predicted values, reducing the gap between the true results and the predicted results, and enhancing the accuracy of the model’s predictions. The loss function of the traditional SSD object detection algorithm consists of two parts: localization loss and classification loss, as shown in Equation (1). The localization loss uses the smooth L1 norm loss, which represents the average error magnitude using the absolute difference between the ground truth and the predicted values, as shown in Equations (2) and (3). The classification loss uses the cross-entropy loss function, where the magnitude of the cross-entropy value reflects the quality of the model, as shown in Equations (4) and (5).

L (e, c, l, g) = \frac{1}{N_{p}} (L_{conf} (e, c) + α L_{l o c} (e, l, g))

(1)

L_{l o c} = \sum_{i, j} \sum_{m \in {x, y, w, h}} e_{i, j}^{match} L_{1}^{smooth} {(l_{i}^{m} - g_{j}^{m})}^{2}

(2)

L_{1}^{smooth} (x) = \{\begin{array}{l} 0.5 x^{2} & | x | < 1 \\ | x | - 0.5 & otherwise \end{array}

(3)

L_{conf} (e, c) = - \sum_{i \in P o s}^{N_{p}} e \log (c_{i}^{p}) - \sum_{i \in N e g} \log (c_{i}^{0})

(4)

c_{i}^{p} = \frac{\exp (c_{i}^{p})}{\sum_{p} \exp (c_{i}^{p})}

(5)

where N_p denotes the number of a priori frames matched with the real frame, e takes the value of 0 or 1, indicating the matching of the anchor frame with the real frame, c denotes the predicted value of the category confidence, l denotes the predicted position of the bounding box corresponding to the a priori frame, g denotes the positional parameter of the real frame, and p ∈ [0, 10] denotes the number of object categories, i = 1, 2, …, n, and n denotes the number of prior boxes, j = 1, 2, …, m, m denotes the number of real frames.

The L₁ norm loss exhibits instability when handling outliers due to its derivative being zero at the origin. The cross-entropy loss function, which relies on the stability of the model, is prone to misjudgment in detection tasks. To address these issues, this paper proposes an enhancement to the loss function: the incorporation of a hyperparameter into the smooth L₁ norm loss and the adoption of the focal loss function for category loss. The refined loss function is presented in Equations (6) and (7).

L_{1}^{mooth} (x) = \{\begin{array}{l} \frac{{(τ x)}^{2}}{2} & | x | < 1 \\ | x | - \frac{0.5}{τ^{2}} & otherwise \end{array}

(6)

L_{conf} (e, c) = - \sum_{i \in P o s}^{N_{f}} {(1 - c_{i}^{p})}^{γ} e_{i j}^{p} \log (c_{i}^{p}) - \sum_{i \in N e g} \log (c_{i}^{o})

(7)

where

c_{i}^{p}

is determined by Equation (5), where i denotes the number of anchor boxes, j signifies the number of ground truth boxes, and p represents the number of object categories, ranging within the interval [0, 10]. The hyperparameters are set as τ = 0.2 and γ = 0.5.

The enhanced localization loss function, augmented with hyperparameters, bolsters the stability of the algorithm. The confidence loss, employing the focal loss function, mitigates the incidence of misdetection, thereby substantially fortifying the algorithm’s robustness.

3.5. Establishment of Indoor Scene Association Rules

Indoor scene images contain a wide variety of objects. If a single object is used as the image feature and input into the naive Bayes classification algorithm for scene recognition, the recognition accuracy will be low, making it unsuitable for practical detection requirements. The Apriori algorithm first traverses the dataset to gradually generate candidate item sets. It then filters the candidate item sets using a predefined support threshold to generate frequent item sets. Finally, it calculates the confidence of each frequent item set and compares it with a predefined threshold to derive association rules between objects. The Apriori algorithm has a simple structure and can handle large-scale datasets. Therefore, this paper uses the Apriori algorithm to mine association rules between objects in indoor scene images, which are then used as features of the scene images and input into the naive Bayes classification algorithm to determine the scene category.

Indoor scenes are divided into six categories: bedroom, bathroom, dining room, kitchen, living room, and office. Indoor objects are divided into ten categories: storage cabinet, chair, table, sofa, vase, toilet, door, refrigerator, washing machine, and bed. To mine association rules between objects using the Apriori algorithm, the minimum thresholds for support and confidence must be set before running the algorithm. Due to the large volume of experimental data in this study, multiple experiments were conducted. When the support is set to 40% and the confidence to 60%, the results are optimal. The specific steps are as follows:

(1) The dataset is read, encompassing categories of scenes and objects, defining the scene transaction set and the object feature item set. The scene collection and the object feature collection are denoted by Y = {y₁, y₂, …y_j, …y₆} and X = {a₁, a₂, …, a_i, …, a₁₀}, respectively.

(2) The dataset is traversed to establish the candidate item set C_p.

(3) The support degree for each candidate item set is calculated, with the computational formula presented as Equation (8).

Support (C_{p}) = Support (a_{m} \Rightarrow a_{n})

(8)

where a_m, a_n denote the object features in the candidate item set.

Further, it is possible to derive a formula for calculating the support of a_m with respect to a_n as shown in Equation (9).

Support (a_{m} \Rightarrow a_{n}) = \frac{P (a_{m}, a_{n})}{N_{C}}

(9)

where N_C denotes the total number of candidate item sets.

Specifically, the formula is shown in Equation (10).

\frac{P (a_{m}, a_{n})}{N_{C}} = \frac{P (a_{m} \cup a_{n})}{N_{C}}

(10)

(4) Filter the set of candidate items according to the set support threshold to form the frequent item set L_q.

(5) Calculate the confidence level of each frequent item set, and the calculation formula is shown in Equation (11).

Confidence (L_{q}) = Confidence (a_{m} \Rightarrow a_{n})

(11)

where a_m, a_n denote the object features in the candidate item set, and q is the number of frequent item sets.

Further, it is possible to derive the formula for calculating the confidence level of a_m with respect to a_n as shown in Equation (12).

Confidence (a_{m} \Rightarrow a_{n}) = \frac{P (a_{m}, a_{n})}{P (a_{m})}

(12)

where N_C denotes the total number of candidate item sets.

Specifically, the formula is shown in Equation (13).

\frac{P (a_{m}, a_{n})}{P (a_{m})} = \frac{P (a_{m} \cup a_{n})}{P (a_{m})}

(13)

where a_m and a_n denote the object features in the candidate item set.

(6) Filter the frequent item set according to the set confidence threshold and generate association rules.

Some of the association rules established using the Apriori algorithm are shown in Table 1.

3.6. Plain Bayesian Indoor Scene Recognition Algorithm

The plain Bayesian Indoor Scene Recognition Algorithm uses the objects in the image and the mined association rules as the features of the scene, using the indoor object detection algorithm to derive the categories of the objects, mining the association rules in it through the Apriori algorithm, and using the plain Bayesian classification algorithm to recognize the scene. It can perceive the scene while detecting the object category and location, and the specific detection process is shown in Figure 5.

The detection procedure of the scene recognition algorithm based on naive Bayes is as follows:

(1): Generation of Textual Features

The improved SSD algorithm is employed to extract category information of objects within the images, which is subsequently transformed into textual features.

(2): Definition of Scene Category Set and Object Feature Set

Consistent with the definition method described in the aforementioned Apriori algorithm, the scene category set is defined as Y = {y₁, y₂, …y_j, …, y₆}, and the object feature set as X = {a₁, a₂, …, a_i, …, a₁₀}.

(3): Mining of Association Rules

Utilizing the Apriori algorithm, association rules among objects are mined based on the predefined support and confidence thresholds (support at 40% and confidence at 60%).

(4): Calculation of Prior Probabilities

Objects, along with sets of objects that exhibit strong association rules, are considered features of the scene. The prior probabilities P(y_j) and P(f_m|y_j) are calculated, with the computational formulas provided as Equations (14) and (15).

P (y_{i}) = \frac{N_{y_{j}}}{N_{A} + k}

(14)

P (f_{m} | y_{j}) = \frac{N_{y_{j}, f_{m}}}{N_{y_{j}} + N_{f_{m}}}

(15)

where f_m denotes the features of a scene, encompassing both individual objects and sets of objects with robust association rules.

N_{y_{j}}

represents the number of samples for the scene category y_j, N_A signifies the total number of samples,

N_{y_{j}, f_{m}}

indicates the count of feature f_m within the scene category y_j,

N_{f_{m}}

stands for the total number of samples for feature f_m, m is the number of features, j = 1, 2, …, 6 corresponds to the number of scene categories, and k = 6 is the total number of scene categories.

In practical classification scenarios, the absence of a particular feature in the dataset can result in a classification probability of zero, thereby compromising the detection outcome. To circumvent this issue, we incorporate Laplace smoothing into the naive Bayes classification algorithm. Laplace smoothing operates on the principle of adding one to the count of each feature to estimate the probability of a feature not present in the sample, as delineated by Equations (16) and (17). Given the substantial size of the sample set, the increment of each feature’s count by one exerts a negligible influence on the classification results. Consequently, Laplace smoothing effectively mitigates the zero probability conundrum.

P (y_{i}) = \frac{N_{y_{j}} + 1}{N_{A} + k}

(16)

P (f_{m} | y_{j}) = \frac{N_{y_{j}, f_{m}} + 1}{N_{y_{j}} + N_{f_{m}}}

(17)

where

N_{y_{j}}

denotes the number of samples for the scene category y_j, N_A represents the total number of samples, f_m signifies the features of a scene, which include individual objects and sets of objects with strong association rules,

N_{y_{j}, f_{m}}

indicates the count of feature f_m within the scene category y_j,

N_{f_{m}}

stands for the total number of samples for feature f_m, m is the number of features, y_j represents the scene category, j = 1, 2, …, 6 corresponds to the number of scene categories, and k = 6 is the total number of scene categories.

(5): Derivation of Scene Recognition Results

Based on the prior probabilities, the naive Bayes classification algorithm is utilized to select the maximum value as the scene recognition result, with the computational formula presented as Equation (18).

\hat{y} = \arg \max_{y_{j}} P (y_{j}) \prod_{i = 1}^{n} P (f_{m} | y_{j})

(18)

where f_m denotes the features of a scene, encompassing both individual objects and sets of objects with robust association rules, m represents the number of features, y_j signifies the scene category, and j = 1, 2, …, 6 corresponds to the number of scene categories.

The naive Bayes indoor scene recognition algorithm employs the aforementioned object detection algorithm to acquire object features, utilizes the Apriori algorithm to mine association rules among objects, and identifies scenes through the naive Bayes algorithm. The algorithm’s strengths lie in its capability to handle multi-classification tasks, recognizing scenes while detecting object categories and locations, and its simple structure coupled with high classification stability.

4. Experimental Results and Analysis

4.1. Experimental Environment and Dataset

The following are the experimental environment and model parameters of this paper. The operating system is Windows 10, the CPU is i5-8250U, the GPU is NVIDIA-SMI 451.67, the Python version is Python 3.8, and PyCharm 2024.1 is selected as the compilation software for the experiment. The deep learning framework is Pytorch 1.7.0. The following are the parameters of the model. The initial learning rate is 0.001, the exponential decay coefficient is 0.001, the momentum magnitude is 0.9, and the batch size is 32. In this paper, the ADE20K indoor scene dataset is used for experiments [27]. Released by MIT in 2016, ADE20K contains 27,000 images, including 25,574 in the training set and 2000 in the validation set. The dataset supports tasks such as semantic segmentation and scene understanding. Images are sourced from SUN (Princeton University, 2010) and Places (MIT, 2014), with privacy information removed to ensure data anonymity.

In this paper, a total of 6000 images are selected during the experiments, which contain six types of scenes, namely, bedroom, bathroom, dining room, kitchen, living room, and office. Ten types of objects, namely, storage cabinet, chair, table, sofa, vase, toilet, door, refrigerator, washing machine, and bed. An example of the image dataset is shown in Figure 6, and the number of images per category is shown in Table 2.

4.2. Data Enhancement

Due to the dim indoor light, the indoor scene image is prone to blurring, defocusing, and other phenomena, and the angle of the image, the degree of brightness, and saturation will have an impact on the detection results. Given these problems, and to avoid the overfitting phenomenon in the detection process, this paper uses basic image processing methods to process the experimental images, and in the training and testing, each iteration will be randomly adjusted to the image using the above methods, to enhance the feature expression of the image, and to improve the detection accuracy of the algorithm. Taking the bathroom scene image as an example, the image is flipped at different angles, and its brightness, contrast, and saturation are adjusted. The enhancement effect is shown in Figure 7.

4.3. Analysis of Experimental Results

In this paper, we use images of scenes in daily homes to conduct experiments and select some experimental results for visualization, taking bedroom, dining room, living room, and bathroom scenes as examples, and the detection results are shown in Figure 8.

In Figure 8, Figure 8a,b show the images of bedroom and dining room scenes, Figure 8c,d show the results of the object detection algorithm, and Figure 8e,f show the results of the scene recognition algorithm. The results show that the algorithm designed in this paper can recognize the scene in an image while detecting the category and location of the objects in the image, and it can successfully detect small-scale objects as well as objects that are occluded from each other.

In order to reflect the advantages of this paper’s algorithm, the average accuracy of the plain Bayesian scene recognition algorithm is compared with the average accuracy of the decision tree, Bayesian, KNN, Random Forest, and SVM supervised learning scene recognition algorithms, and the results are shown in Table 3.

From Table 3, we can conclude that the algorithm we proposed achieved a relatively high accuracy rate in both categories. Our proposed algorithm outperforms the Random Forest algorithm by 0.03 in the bedrooms scenario and by 0.01 in the dining room scenario. Our proposed algorithm outperforms the KNN algorithm by 0.08 in the bedrooms scenario and 0.05 in the dining room scenario. Our proposed algorithm outperforms the SVM algorithm by 0.05 in the bedrooms scenario and 0.02 in the dining room scenario. Our proposed algorithm outperforms the Bayesian algorithm by 0.04 in the bedrooms scenario and by 0.02 in the dining room scenario. Our proposed algorithm outperforms the decision tree algorithm by 0.19 in the bedrooms scenario and by 0.22 in the dining room scenario.

The accuracy rate is the main index describing the advantages and disadvantages of the classification model, which can directly express the classification ability of the classification algorithm on the data. In this paper, we use the accuracy rate to evaluate the classification effect of the plain Bayesian classification algorithm on the indoor scene recognition dataset, and the formula for the accuracy rate is shown in Equation (19).

A c c u r a c y = \frac{N_{τ}}{N_{M}}

(19)

where N_τ denotes the number of correctly predicted images and N_M denotes the total number of images.

The recognition accuracies for each scene category calculated from the above Equations are shown in Table 4.

As shown in Table 4, the proposed method achieves an average indoor scene recognition accuracy of 95.23% across six scene categories. The bathroom and bedroom scenes achieve the highest accuracies—98.64% and 96.28%, respectively—likely due to the presence of distinctive objects such as toilets and beds. In contrast, the office scene yields the lowest accuracy of 92.33%, possibly because its common objects (e.g., tables and chairs) are often confused with those in living rooms or dining rooms. These results indicate that the proposed algorithm performs well and meets practical application requirements.

To further highlight its effectiveness, we compare the average accuracy of the naive Bayes-based method with those of several supervised learning algorithms, including decision tree, KNN, Random Forest, and SVM, as summarized in Table 5.

From Table 5, the average accuracy of this algorithm in scene recognition experiments is 95.23%. Compared with the decision tree algorithm, it is 32.63% higher; compared with the Bayesian algorithm, it is 3.73% higher; compared with the SVM algorithm, it is 2.38% higher; compared with the KNN, it is 5.96% higher, and compared with the Random Forest algorithm, it is 1.21% higher. The results show that compared with other supervised learning scene recognition algorithms, this paper’s algorithm can effectively improve the feature expression ability of the image by mining the association rules between the objects before scene classification, which, in turn, improves the average accuracy of scene recognition.

In order to more intuitively describe the detection effect of this paper’s algorithm and verify the stability of the algorithm designed in this paper, 1800 indoor scene images are selected for experiments. The scene category is divided into three groups, each group has 600 images, and each category has 100 images. The experimental results are plotted as a confusion matrix, as shown in Figure 9. The horizontal coordinate of the confusion matrix represents the predicted labels, the vertical coordinate represents the real labels, the sum of each row represents the number of real samples of the category, the sum of each column represents the number of samples predicted to be in the category, and the diagonal line represents the number of correct classifications of the category under 100 samples.

Figure 9a–c presents the confusion matrices from three experimental runs. Bedroom, bathroom, and dining room scenes achieve correct recognition rates above 95% across all experiments, while kitchen, office, and living room scenes show slightly lower performance, with accuracies still exceeding 90%. Additionally, the detection accuracy for each scene category in the three confusion matrices fluctuates by no more than 2%. Overall, the proposed algorithm demonstrates high recognition accuracy and classification stability, meeting the requirements of practical detection. We recorded the accuracy rate of each category, the error line, and the accuracy rate of each confusion matrix in Table 5. The experimental results are shown in Table 6.

As shown in Table 6, the average classification accuracy of the method proposed in this paper on the three datasets is 95.17 ± 0.50%. Our model maintained a stable accuracy of 95.17 ± 0.50%, indicating excellent robustness and reliability in different environments. The small standard error (±0.50%) of multiple test datasets further confirmed that the proposed method has a stable and high accuracy rate.

In summary, the indoor scene recognition algorithm based on naive Bayes designed in this paper has a simple structure and high classification stability. It can recognize scenes while detecting object categories and locations, and its recognition accuracy is higher than other object detection algorithms and existing supervised learning scene recognition algorithms.

4.4. Analysis of Ablation Experiment Results

In order to evaluate the influence of the Mobile-EFSSD algorithm, our design was based on the object detection performance in indoor scene images, and a series of ablation experiments were conducted. The experimental results are shown in Table 7.

It can be obtained from Table 7 that the mAP value of the SSD algorithm is 70.5% and the FPS is 47.33. The backbone network of the SSD algorithm was replaced with MobileNetV3, which reduced the parameter count of the algorithm by 4.3M and increased the FPS by 10.94. The use of the ECA attention mechanism in the bneck of MobileNetV3 increased the mAP value of the algorithm by 6.4%. The FPN feature fusion structure was introduced on the basis of the traditional SSD algorithm, which increased the mAP value of the algorithm by 7.9%, increased the number of parameters by 1.4M, and reduced the FPS by 2.43. This situation occurs because although the introduction of the feature fusion module can significantly improve the detection accuracy of the algorithm, it will lead to an increase in the complexity of the algorithm. The backbone network of the SSD algorithm was replaced with MobileNetV3 containing the ECA attention mechanism, and the FPN feature fusion module was introduced simultaneously, which increased the mAP value of the algorithm by 12.2%, reduced the number of parameters by 1.9M, and increased the FPS by 7.45. It fully demonstrates that the Mobile-EFSSD algorithm has high detection accuracy and fast detection speed, and can meet the requirements of actual indoor object detection.

4.5. Analysis of Comparative Experimental Results

In order to verify the advantages of the Mobile-EFSSD algorithm, the mean average accuracy of the Mobile-EFSSD algorithm and other single-stage object detection algorithms on the same dataset was calculated, respectively, and the FPS was visualized. The results are shown in Table 8.

It can be obtained from Table 8 that the average detection accuracy of the Mobile EFSSD algorithm is 82.7%, which is 20.1% higher than that of the YOLOv5 algorithm, 2.5% higher than that of the YOLOv7 algorithm, 12.2% higher than that of the SSD algorithm, and 4.8% higher than that of the DSSD algorithm. The FPS of the Mobile-EFSSD algorithm is 54.78, which is 22.05 higher than that of the YOLOv5 algorithm, 16.23 higher than that of the YOLOv7 algorithm, and 7.45 higher than that of the SSD algorithm. Compared with the DSSD algorithm, the FPS is increased by 19.17, which fully demonstrates that the Mobile-EFSSD algorithm is superior to other single-stage object detection algorithms in both detection accuracy and speed.

4.6. Test Results of Other Datasets

To verify the generalization ability of the Mobile-EFSSD method we proposed, we conducted tests on the NEU-DET dataset, and the experimental results are shown in Figure 10.

As shown in Figure 10, the proposed Mobile-EFSSD method achieves promising results on the NEU-DET dataset, where most defects are accurately detected, and the detection accuracy meets a satisfactory level. Although the model was originally designed for indoor object detection, its successful application to the steel surface defect dataset demonstrates strong generalization ability. These results indicate that our method not only performs well in indoor object detection scenarios but also adapts effectively to industrial vision tasks such as defect detection in steel products.

5. Conclusions

With the rapid development of deep learning and image processing technologies, scene recognition algorithms have emerged in large numbers. Indoor scenes are an important part of scene classification, and indoor scene recognition technology is widely used in fields such as smart homes, video surveillance, intelligent service robots, image classification, and retrieval. However, single-scene recognition techniques are difficult to apply effectively in practical scenarios. For example, intelligent service robots use scene recognition technology to perceive the current environment and object detection technology to identify objects in images, enabling them to execute corresponding commands. Therefore, simultaneously detecting the categories and locations of objects in images and accurately perceiving the scene is a current research hotspot. In the process of researching indoor object detection and scene recognition algorithms, the work done in this paper is as follows:

(1) A Mobile-EFSSD-based indoor object detection algorithm is designed. The specific improvements are as follows: First, an improved MobileNetV3 is used as the backbone network for the SSD algorithm. Second, the FPN feature fusion module is introduced to fuse high-level and low-level feature maps. Finally, a hyperparameter is added to the localization loss function, and the confidence loss function is replaced with the focal loss function.

(2) A naive Bayes-based indoor scene recognition algorithm is designed. The detection steps of this algorithm are as follows: Firstly, convert the indoor target detection results described in Section 3.1 into text features. Second, the Apriori algorithm is used to mine association rules between objects in the scene, and the prior probabilities of objects and strongly associated object sets appearing in the scene are calculated. Finally, based on the prior probabilities, the naive Bayes classification algorithm is used to determine the scene category.

(3) Although our primary focus was on improving detection performance in indoor scenes, we carefully monitored the training process to assess potential overfitting. Throughout training, we observed consistent trends between training and validation performance in terms of mAP and loss values, with no significant performance drops from training to validation sets. This suggests that the model generalizes well to unseen data.

(4) In future work, we will actively collect more image samples from real life to further enhance the generalization ability and practicality of the model. We also plan to open-source these expanded datasets to provide more valuable references for the community.

Additionally, the lightweight design of Mobile-EFSSD inherently helps reduce overfitting risk by maintaining a balanced model complexity relative to the dataset size. We also applied common data augmentation techniques during training (e.g., random cropping, color jitter, horizontal flipping), which further improve generalization. While we did not explicitly implement advanced regularization methods such as dropout or weight decay, the overall architecture and training strategy demonstrate stable convergence behavior without signs of severe overfitting.

Author Contributions

Conceptualization, W.Z. (Wenda Zheng) and Y.A.; Methodology, W.Z. (Wenda Zheng), Y.A. and W.Z. (Weidong Zhang); Software, W.Z. (Weidong Zhang); Validation, W.Z. (Weidong Zhang); Writing—original draft, W.Z. (Wenda Zheng) and Y.A.; Writing—review & editing, W.Z. (Wenda Zheng); Visualization, W.Z. (Weidong Zhang); Supervision, W.Z. (Weidong Zhang). All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available in [Computer Vision and Pattern Recognition] at [https://doi.org/10.48550/arXiv.1608.05442], reference number [arXiv:1608.05442]. These data were derived from the following resources available in the public domain: [https://doi.org/10.48550/arXiv.1608.05442] [27].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Araki, R.; Hirakawa, T.; Yamashita, T.; Fujiyoshi, H. MT-DSSD: Multi-task deconvolutional single shot detector for object detection segmentation. Adv. Robot. 2022, 36, 373–387. [Google Scholar] [CrossRef]
Pan, H.; Jiang, J.; Chen, G. TDFSSD: Top-down feature fusion single shot multibox detector. Signal Process. Image Commun. 2020, 89, 115987–115992. [Google Scholar] [CrossRef]
Kang, S.; Park, J. Improving small object detection in ssd. Sensors 2023, 23, 2589. [Google Scholar] [CrossRef]
Deng, X.; Li, S. An improved ssd object detection algorithm based on attention mechanism and feature fusio. J. Phys. Conf. Ser. 2023, 2450, 12088–12094. [Google Scholar] [CrossRef]
Huo, B.; Li, C.; Zhang, J.; Xue, Y.; Lin, Z. SAFF-SSD: Self-attention combined feature fusion-based ssd for small object detection in remote sensing. Remote Sens. 2023, 15, 3027. [Google Scholar] [CrossRef]
Qian, H.; Wang, H.; Feng, S.; Yan, S. FESSD: SSD target detection based on feature fusion and feature enhancement. J. Real-Time Image Process. 2023, 20, 2–8. [Google Scholar] [CrossRef]
Wang, H.; Qian, H.; Feng, S.; Wang, W. L-SSD: Lightweight ssd target detection based on depth-separable convolution. J. Real-Time Image Process. 2024, 21, 12–15. [Google Scholar] [CrossRef]
Liu, L.; Yin, Y. Towards explainable ai on chest x-ray diagnosis using image segmentation and cam visualization. In Advances in Information and Communication, Proceedings of the FICC 2023, Virtual Event, 2–3 March 2023; Springer Nature: Cham, Switerland, 2023; pp. 659–675. [Google Scholar]
Pichai, K.; Park, B.; Bao, A.; Yin, Y. Automated Segmentation and Classification of Aerial Forest Imagery. Analytics 2022, 1, 135–143. [Google Scholar] [CrossRef]
Zheng, Q.; Gong, M.; You, X.; Tao, D. A unified b-spline framework for scale-invariant keypoint detection. Int. J. Comput. Vis. 2022, 130, 777–779. [Google Scholar] [CrossRef]
Anderson, M.D.; Graf, E.W.; Elder, J.H.; Ehinger, K.A.; Adams, W.J. Category systems for real-world scenes. J. Vis. 2021, 21, 8–12. [Google Scholar] [CrossRef] [PubMed]
Navneet, D. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–25 June 2005; pp. 886–893. [Google Scholar] [CrossRef]
Wu, J.; Rehg, J.M. Centrist: A visual descriptor for scene categorization. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 1489–1501. [Google Scholar] [CrossRef]
Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 2–5 November 2010; pp. 270–279. [Google Scholar] [CrossRef]
Yee, P.S.; Lim, K.M.; Lee, C.P. DeepScene: Scene classification via convolutional neural network with spatial pyramid pooling. Expert Syst. Appl. 2022, 193, 116382–116388. [Google Scholar] [CrossRef]
Karjigi, V.; Sreedevi, N. Speech intelligibility assessment of dysarthria using fisher vector encoding. Comput. Speech Lang. 2023, 77, 101411–101417. [Google Scholar] [CrossRef]
Bai, S. Sparse code lbp and sift features together for scene categorization. In Proceedings of the International Conference on Audio Language and Image Processing, Shanghai, China, 7–9 July 2014; pp. 200–205. [Google Scholar] [CrossRef]
Ding, Y.; Zhang, Z.; Zhao, X.; Hong, D.; Yu, C.; Yang, N.; Cai, W. Multi-feature fusion: Graph neural network and cnn combining for hyperspectral image classification. Neurocomputing 2022, 501, 246–257. [Google Scholar] [CrossRef]
Shrinivasa, S.R.; Prabhakar, C.J. Scene image classification based on visual words concatenation of local and global features. Multimed. Tools Appl. 2022, 81, 1237–1256. [Google Scholar] [CrossRef]
Jiang, S.; Chen, G.; Song, X.; Liu, L. Deep patch representations with shared codebook for scene classification. ACM Trans. Multimed. Comput. Commun. Appl. 2019, 15, 1–17. [Google Scholar] [CrossRef]
Wang, Q.; Huang, W. Looking closer at the scene: Multiscale representation learning for remote sensing image scene classification. IEEE Trans. Neural Netw. Learn. Syst. 2020, 33, 1414–1428. [Google Scholar] [CrossRef] [PubMed]
Khan, A.; Chefranov, A.; Demirel, H. Image scene geometry recognition using low-level features fusion at multi-layer deep cnn. Neurocomputing 2021, 440, 111–126. [Google Scholar] [CrossRef]
Zhang, L.; Wei, X.; Lu, J.; Pan, J. Lasso regression: From explanation to prediction. Adv. Psychol. Sci. 2020, 28, 1777. [Google Scholar] [CrossRef]
López-Cifuentes, A.; Escudero-Vinolo, M.; Bescós, J.; García-Martín, Á. Semantic-aware scene recognition. Pattern Recognit. 2020, 102, 107256–107263. [Google Scholar] [CrossRef]
Gladis, K.P.A.; Madavarapu, J.B.; Kumar, R.R.; Sugashini, T. In-out YOLO glass: Indoor-outdoor object detection using adaptive spatial pooling squeeze and attention YOLO network. Biomed. Signal Process. Control 2024, 91, 105925. [Google Scholar] [CrossRef]
Wang, Y.; Wang, W.; Zhou, X.; Dong, X.; Li, J.; Zhao, Q.; Yang, X.; Wang, Y. Research and application of a multitarget detection algorithm based on improved YOLOv8 for indoor objects. IEEE Access 2025, 13. [Google Scholar] [CrossRef]
Zhou, B.; Zhao, H.; Puig, X.; Xiao, T.; Fidler, S.; Barriuso, A.; Torralba, A. Semantic understanding of scenes through the ade20k dataset. arXiv 2019. [Google Scholar] [CrossRef]

Figure 1. Network structure of the Mobile-EFSSD algorithm.

Figure 2. Structure of bneck.

Figure 3. Structure of the improved bneck.

Figure 4. Schematic of the FPN feature fusion module.

Figure 5. Flowchart of indoor scene recognition based on naive Bayes.

Figure 6. Example of the dataset.

Figure 7. Example of data enhancement.

Figure 8. Results of the indoor scene recognition.

Figure 9. Confusion matrix results of the scene recognition.

Figure 10. Test results of the NEU-DET dataset.

Table 1. Example of partial association rules.

Association Rules	Support/%	Confidence/%
Storage Cabinet → Door	43.29	65.03
Toilet → washing machine	88.47	72.65
Vase → Table	54.22	71.25
Table → Chair	79.64	90.08
Sofa → Storage cabinet	72.35	66.42
Refrigerator → Table	66.92	81.66
Chair → Vase	76.03	62.75
Toilet → Door	42.50	64
Sofa → Door	48.11	61.01
Storage cabinets → beds	57.62	77.28
Refrigerator → storage cabinet	64.58	74.85

Table 2. Sample size for various scenes.

Indoor Scene Category	Quantities
Bedrooms	616
Bathroom	1664
Dining room	1124
kitchen	1154
Living room	822
office	620

Table 3. Recognition accuracy for scene categories.

Scene Category	Bedrooms	Dining Room
Decision Tree	0.76	0.52
Bayesian	0.91	0.76
SVM	0.90	0.72
KNN	0.87	0.69
Random Forest	0.92	0.73
Our Algorithms	0.95	0.74

Table 4. Recognition accuracy for scene categories.

Scene Category	Bedrooms	Bathroom	Dining Room	Kitchen	Office	Restaurant
Accuracy/%	96.28	98.64	95.09	95.62	92.33	93.42

Table 5. Comparison of average accuracy.

Scene Recognition Algorithm	Average Accuracy/%
decision tree	62.60
Bayesian	91.50
SVM	92.85
KNN	89.27
Random Forest	94.02
Our algorithms	95.23

Table 6. Performance summary across multiple datasets.

	Bedrooms	Bathroom	Dining Room	Kitchen	Office	Restaurant	Accuracy
a	0.97	0.99	0.95	0.97	0.93	0.95	0.96
b	0.96	0.97	0.97	0.95	0.93	0.93	0.952
c	0.96	0.96	0.97	0.94	0.9	0.93	0.943
Mean ± SD	0.9517 ± 0.0049

Table 7. Ablation experiment.

	MobileNetV3	ECA	FPN	mAP	Parameters	FPS
1				70.5	40.7	47.33
2	√			70.6	36.4	58.27
3		√		76.9	36.6	56.64
4			√	78.4	42.1	44.90
5	√	√	√	82.7	38.8	54.78

Table 8. Comparisons of algorithm performances.

Model	mAP/%	FPS
YOLOv5	62.6	32.73
YOLOv7	80.2	38.55
SSD	70.5	47.33
DSSD	77.9	35.61
Mobile-EFSSD	82.7	54.78

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zheng, W.; Ai, Y.; Zhang, W. Research on Indoor Object Detection and Scene Recognition Algorithm Based on Apriori Algorithm and Mobile-EFSSD Model. Mathematics 2025, 13, 2408. https://doi.org/10.3390/math13152408

AMA Style

Zheng W, Ai Y, Zhang W. Research on Indoor Object Detection and Scene Recognition Algorithm Based on Apriori Algorithm and Mobile-EFSSD Model. Mathematics. 2025; 13(15):2408. https://doi.org/10.3390/math13152408

Chicago/Turabian Style

Zheng, Wenda, Yibo Ai, and Weidong Zhang. 2025. "Research on Indoor Object Detection and Scene Recognition Algorithm Based on Apriori Algorithm and Mobile-EFSSD Model" Mathematics 13, no. 15: 2408. https://doi.org/10.3390/math13152408

APA Style

Zheng, W., Ai, Y., & Zhang, W. (2025). Research on Indoor Object Detection and Scene Recognition Algorithm Based on Apriori Algorithm and Mobile-EFSSD Model. Mathematics, 13(15), 2408. https://doi.org/10.3390/math13152408

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Indoor Object Detection and Scene Recognition Algorithm Based on Apriori Algorithm and Mobile-EFSSD Model

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Overall Framework

3.2. Improvements to the Feature Extraction Module

3.3. Introduction of the Feature Fusion Module

3.4. Improvement of the Loss Function

3.5. Establishment of Indoor Scene Association Rules

3.6. Plain Bayesian Indoor Scene Recognition Algorithm

4. Experimental Results and Analysis

4.1. Experimental Environment and Dataset

4.2. Data Enhancement

4.3. Analysis of Experimental Results

4.4. Analysis of Ablation Experiment Results

4.5. Analysis of Comparative Experimental Results

4.6. Test Results of Other Datasets

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI