Next Article in Journal
Implicit and Explicit Regularization for Optical Flow Estimation
Previous Article in Journal
Advanced Diabetes Management Using Artificial Intelligence and Continuous Glucose Monitoring Sensors
Open AccessArticle

A Novel Statistical Method for Scene Classification Based on Multi-Object Categorization and Logistic Regression

1
Department of Computer Science and Engineering, Air University, E-9, Islamabad 44000, Pakistan
2
Department of Human-Computer Interaction, Hanyang University, Ansan 15588, Korea
*
Author to whom correspondence should be addressed.
Sensors 2020, 20(14), 3871; https://doi.org/10.3390/s20143871
Received: 27 May 2020 / Revised: 24 June 2020 / Accepted: 8 July 2020 / Published: 10 July 2020
(This article belongs to the Section Intelligent Sensors)

Abstract

In recent years, interest in scene classification of different indoor-outdoor scene images has increased due to major developments in visual sensor techniques. Scene classification has been demonstrated to be an efficient method for environmental observations but it is a challenging task considering the complexity of multiple objects in scenery images. These images include a combination of different properties and objects i.e., (color, text, and regions) and they are classified on the basis of optimal features. In this paper, an efficient multiclass objects categorization method is proposed for the indoor-outdoor scene classification of scenery images using benchmark datasets. We illustrate two improved methods, fuzzy c-mean and mean shift algorithms, which infer multiple object segmentation in complex images. Multiple object categorization is achieved through multiple kernel learning (MKL), which considers local descriptors and signatures of regions. The relations between multiple objects are then examined by intersection over union algorithm. Finally, scene classification is achieved by using Multi-class Logistic Regression (McLR). Experimental evaluation demonstrated that our scene classification method is superior compared to other conventional methods, especially when dealing with complex images. Our system should be applicable in various domains such as drone targeting, autonomous driving, Global positioning systems, robotics and tourist guide applications.
Keywords: adaptive weighted median filter; fuzzy c-mean segmentation; logistic regression; multiple objects categorization; multiple kernel learning; scene classification; visual sensors adaptive weighted median filter; fuzzy c-mean segmentation; logistic regression; multiple objects categorization; multiple kernel learning; scene classification; visual sensors

1. Introduction

Scene classification uses visual sensor technologies to explore the semantically significant information contained inside an image. Scene classification is the process of assigning categorizing labels to whole scenes based on the visual sensory data of the scene and the structure and relationships between multiple objects presented in the images. Sensors identify two broad categories (i.e., indoor and outdoor) to generally classify different scenes and these are further divided into different sub-categories based on the categories and labels pertaining to the specific multiple objects presented in the images. Visual sensors use the different properties of objects such as their local and global features to classify the whole scene. Scenery images comprise a wide variety of knowledge about the behavior of various objects which have visible features such as borders, corners, and point clouds and these enable us to learn, modify, consider alternative solutions and create new techniques to examine complex scenes. Scene interpretation [1,2] should be capable of accommodating changes in the environment being observed, identifying the vital characteristics of various objects and defining relationships among various objects in order to represent the actual scene behaviors [3,4].
Such scene information needs consistent and accurate object classification that intends to distinguish the images by evaluating semantic object properties. Object classification has become an extensively adopted field in various applications such as smart monitoring and image fetching. It also offers supplemental knowledge in the fields of activity recognition. Apart from object classification which only concentrates on limited parts of an image, scene classification is the next step that leads to scene recognition and labelling based on such limited object information [5,6]. Many scenes are comprised of complicated object relationships and, because variations among scenes can be quite subtle, accurate scene classification is a challenging task in the area of pattern matching and recognition.
The main function of scene classification is to recognize all the objects presented in the scene and to describe semantics for the accurate labeling of the whole scene. Researchers and scientists have produced a lot of work on multiple object categorization [7] for scene classification but there are still several challenges that can affect the accuracy of object categorization and recognition such as changes in illumination, the size of objects, view orientation, and occlusion between objects in complex images. Several articles [8] used a place category strategy that presents a more detailed list of the objects, summary of their spatial correlations and other static features to discriminate scenes, which affect recognition accuracy. Therefore, we propose a novel methodology, which presents the combined effects of similar region clustering, textures of objects, local/global descriptors and class distribution probability estimation. Our novel methodology produces significant performance effects compared to existing methods.
To overcome the challenges encountered in scene classification, we propose a multiple objects categorization-based method to perform scene classification of scenery images from benchmark datasets. As the first step, the proposed system preprocesses the images. We achieve efficient segmentation using two segmentation algorithms, (i) Modified Fast Super-Pixel Based Fuzzy C-Mean Segmentation (MFCS) image segmentation and (ii) Mean Shift Segmentation (MSS). In the second step the results of two algorithms are compared and analyzed. In the third step, we achieve multiple object categorization by evaluating the multiple regions detector, matching the signatures and local descriptors of the regions of images. Kernel function is used to achieve an object similarity score. Finally, the Estimated Intersection over Union (EIOU) and Multi-class Logistic Regression (McLR) are used for scene classification over challenging datasets. The main contributions of our work are as follows.
  • To the best of our knowledge, this is the first time that signatures of objects, local descriptors and multiple kernel learning for objects categorization and multi-class logistic regression for scene classification have been introduced.
  • Fusing of Geometric and SIFT feature descriptors for objects and scene classification.
  • Accurate multiple region extraction and label indexing of complex scene datasets.
  • Significant improvement in the accuracy of object and scene classification with less computational time compared to other state-of-the-art methods.
Related work is discussed in Section 2. Section 3 illustrates and details the methodology of our proposed scene classification system. Section 4 presents an analysis of our experimental results and a detailed description of the datasets. Section 5 concludes this paper.

2. Related Work

Exploring multiple object locations, their scale, view orientation and the impact of scenery images are challenging tasks in the visual sensors [9,10] field. We have studied the literature in several domains such as multi-object categorization, object segmentation as well as labeling and scene classification in order to establish proper parameters and metrices for our proposed method.

2.1. Object Segmentation

Image segmentation consists of transforming an image into a set of pixel regions represented by a mask or labels in an image. This transformation of an image into a set of pixels (a segment) allows the processing of important segments only. There are numerous techniques for the segmentation of the objects. In Sezgin et al. [11] categorized thresholding techniques into the following groups: (a) a histogram shape-based technique, (b) a clustering-based technique, (c) an entropy-based technique, (d) an object attribute-based technique, (e) a spatial method, and (f) thresholding methods. In Sujji et al. [12] discussed threshold techniques where they wanted to segment an image to detect the contours of tumors in the brain. In Bi et al. [13] proposed a segmentation method according to the fusion of motion, color and stereo cues of objects. In Yan et al. [14] proposed k-means clustering based on color image enhancement for the segmentation of cells. They computed the gray value components of R, G, and B distributions to find the mean value of these distributions. Additionally, they used YCbCr color space to represent the three clusters, achieved by dividing the improved color images. In Kamdi et al. [15] explained region growing algorithms for segmentation by comparing advantages and disadvantages. Moreover, they divided the image into regions of similar pixels by mean and by min-max techniques. In K-means clustering, the number of k segments is defined to partition the image into k groups. K groups are formed based on the similarity of color intensity or on the minimum variance from the centroid to the target pixel.

2.2. Single/Multiple Object Categorization

The object categorization field opens a lot of challenges for researchers in the form of finding the location of each object, identifying and describing the interactions among objects, identifying occluding objects, and delineating groups for meaningful outcomes. In Wong et al. [16] proposed an algorithm for detecting an object online and a classification of the various objects in the image. They suggested fast tracking all the objects in the scene via kernel learning instead of depending on prior knowledge of the specific object. Their implementation was performed on a Neovision2 tower benchmark dataset, which was a biologically inspired implementation that determined the shape and the movement of an object. In Sumbul et al. [17] devised the methods which included the attention of a multisource region network that calculated the pre-source feature illustration and assigned attention scores to member regions tested around the demanded object positions by utilizing their representations. They used multispectral techniques that achieved accuracies up to 64.2%. In Martin et al. [18] designed a Bayesian inference model to examine prior knowledge of each object for multiple object tracking. Then, it updated the possible mass function for closer object discrimination and applied a rate of convergence for correct classification. In Lecumberry et al. [19] computed a shape similarity measure and the steepest descent minimization method for modeling each object’s shape iteration. They used energy optimization for the automatic classification of multiple objects.

2.3. Scene Classification

Similarly, scene classification is a domain that provides new directions such as complicated scene contents/labels due to major ambiguities [20], similar objects properties among different scenes, and multi-instance learning in confused scenes. In Shi et al. [21] proposed a context-based saliency detection algorithm that marks saliency regions in images. They used a CNN model to construct feature points tested on five datasets, i.e., LabelMe, UIUC-Sports, Scene-15, MIT67, and SUN which produced effective results only with indoor scenes. In Zhang et al. [22] proposed the MVFL-VC method along with labeled object categorization algorithms. On the other hand, a mapping function was used to find the correlation with their labels in images. In Zhou et al. [23] proposed a simple method for indoor-outdoor scene classification, which included a bag-of-features model to construct multiple resolution images and highlighted it with dense regions. Then, partition modalities were used to produce better results for scene classification.
In Hayat et al. [24] introduced an indoor scene categorization method based on large-scale spatial layout, scale variations and rich feature descriptors for multiple distinct objects. In addition, tailored feature representations were learned by a Convolution Neural Network to effectively adopt large-scale classification. In Zou et al. [25] proposed an effective scene classification approach where fusion of local/global spatial features were adopted as collaborative representation. These features were processed by multiscale completed local binary patterns, Gabor features and SIFT patterns. Finally, they implemented Kernel collaborative classification for scene discrimination. In Ismail et al. [26] proposed a method consisting of two steps for indoor scene classification. Initially, spatial layout estimation was performed to estimate three orthogonal vanishing points and then the relationships between scene elements were represented by a layout estimation method to retrieve a high scene classification score.

3. Overview of Solution Framework

In this section, we propose a novel scene classification approach along with object categorization that accurately recognizes and labels all target objects presented in the scene. The proposed scene classification system starts with preprocessing and clearing unwanted information such as noise contents and with the normalization of object sizes for all images in the datasets. Then, the extracted data are applied to accurate object segmentation based on two distinct segmentation algorithms: modified fast super-pixel based fuzzy c-means clustering and mean shift segmentation algorithms. Multiple objects categorization is performed by considering multiple kernel learning. Finally, the proposed system achieves scene classification by using the EIOU score and McLR. Figure 1 presents an overview of the proposed scene classification system.

3.1. Preprocessing and Normalization

During preprocessing, images are captured under different conditions such as various lights and environments which produce noise and high intensity values in the images (see Figure 2a). Therefore, to solve these issues, an Adaptive Weighted Median Filter (AWMF) [27] is applied. Such filters use an M × N sliding window which slides over all the images. It uses the local statistic weights of the image for the filtering process. The relative weights W i , j of the pixels (i, j) are calculated as:
W i , j = W 0 a D V x y 2 U x y
where W 0 indicates the weight of the central pixel of the frame of the filter (i.e., 3 × 3 or 5 × 5 ), “a” is the scaling factor used for the scale of frame of the filter (i.e., 3 or 5) and D is Euclidean distance between pixels.   U x , y and V x , y are the mean and variance of the M × N sliding window respectively. U x , y and V x , y are achieved as follows:
  U x , y = 1 M N i = 0 M 1 j = 0 N 1 x i , j
V x , y = 1 M N 1 i = 0 M 1 j = 0 N 1 x i , j U i , j
Figure 2 demonstrates the preprocessing steps which include both noisy images and filtered images.

3.2. Single/Multiple Object Segmentation

This section provides a detailed description of single/multiple object segmentation. Object segmentation is a process in which an image is split into multiple regions. Segmentation can be achieved according to similarities in pixels or colors in a scene. As different scenes contain multiple regions, the delineation or demarcation of these regions through segmentation is a significant but challenging process in scene classification. Accuracy in segmentation greatly influences accuracy and consistency in scene classification. Images are segmented into multiple regions which are labeled with different colors. To process object segmentation, two robust segmentation methods are considered as, (i) Modified fast super-pixel based fuzzy c-means clustering image segmentation (MFCS) and (ii) mean shift segmentation (MSS).

3.2.1. Modified Fast Super-pixel Based Fuzzy C-Mean Segmentation (MFCS)

Using the MFCS clustering algorithm, we achieved improved color image segmentation results compared to conventional FCM [28] methods. At the start of the process, overlapping elements are identified and pixels are taken as data points similar to the clustering approach. Then, each pixel that reveals fuzzy logic is considered to belong to more than one cluster rather than to just one defined cluster. The MFCS achieves the segmentation of the image by minimizing the objective function during iterations. In addition, these elements restrict optimal clusters of images by minimizing the weights within the clusters through a squared error objective function J M ( U , V ) which is formulated as:
J M ( U , V ) = i = 1 c j = 1 n u i j r | x j v i | 2
where c represents the number of clusters, n is the data points having r any real numbers in i t h cluster which show the fuzziness of the resulting cluster, u i j r represents the membership of x j pixels of data in the i t h cluster and v i which shows the cluster center:
u i j = 1 k = 1 c ( | x j v i | | x j v k | 2 2 ) 1 r 1
u i j [ 0 , 1 ] ,   for   i = [ 1 , , c ]
v i = j = 1 n u i j r x j j = 1 n u i j r
J M ( U , V ) is used to measure the distance between the corresponding pixel and the cluster center. The corresponding pixel is assigned with high value of membership when the distance between the pixel and the cluster center is minimum. The conventional FCM algorithm works on the local spatial information of pixels in images such that all neighboring regions of pixels cause high computation complexity due to analysis of spatial values at each iteration. Therefore, the proposed algorithm uses super pixel-based pre-segmentation [29] and density-based spatial clustering with noise (DBSCN) to decrease the computational complexity of Conventional FCM. Figure 3 presents the results of super pixel-based pre-segmentation. The proposed method achieved the segmentation of the color image in a few seconds on the MatLab platform running on an Intel(R) CPU 2.5 GHz core-i5 CPU 2.5 GHz and 8 GB of RAM (Intel, Santa Clara, CA, USA).
The set of data points are shown as x i = x 1 , , x n , and v i = v 1 , , v c shows the set of cluster centers and r (any real numbers) shows the fuzziness of resulting clusters. The proposed MFCS, Algorithm 1, is carried out in steps, and the pseudo code of the MFCS algorithm is given as follows:
Algorithm 1. Pseudo code of the MFCM Algorithm
1: Initialize the clusters c randomly
2: calculate the centers v i of clusters c
3: while minimum value of objective function   J M ( U , V ) do
4:   for each data point in an image do
5:    Step 1. Measure the membership u i j of given data point to clusters c
6:    Step 2. Update the cluster centers v i
7:   end for
8: end while
Figure 4 presents the results of the proposed MFCS algorithm over MSRC dataset.

3.2.2. Mean Shift-Based Segmentation (MSS)

The proposed system achieves the segmentation of an image in multiple regions using the Mean Shift Segmentation [30] algorithm. The MSS algorithm searches for the highest concentration of similar pixels space in the sample image and estimates the local density of pixels. MSS then performs density estimation iteratively and finds the minimum local value for density [31] so that all pixels having local density near to local minimum density are easily shifted to clusters of similar attributes (see Figure 5). This is a non-parametric clustering technique which does not depend on any prior knowledge of the objects or picture elements. Therefore, it can find cluster centers quickly and perform efficient object segmentation. Meanwhile, the proposed system uses kernel density estimation to find the minimum local value of density. Such kernel density k E ( x ) of window function is estimated at D dimensional space S D for n pixels x j , j = 1 ,   2 ,   3 ,     ,   n at a location of x can be determined as:
k E ( x ) = 1 n j = 1 n 1 h n D k ( x x j h x j )
where h x j is the width of kernel density (window function) which can be determined as:
h ( x j ) = h × ( 1 d ( x j ) )
where d ( x j ) is probability density function of given pixels space and h is a constant. Kernel density (window function) K ( x ) satisfies the given condition as:
s k ( x ) d x = 1
S D x k ( x ) d x = 0
Thus, the proposed system analyses the results of MFCS and MSS algorithms with respect to segmentation accuracies along with ground truths and computation time efficiency. MFSC takes less computation time and produces clearer results compared to MSS. MFCS performance is more significant and better than MSS, therefore we used MFCS results for further experiments. Figure 6 indicates the comparison between the MFCS and MSS. The segmentation accuracies are evaluated by comparing the results with given ground truths of all classes from the dataset. Evaluation is carried out on the basis of pixels of segmented objects and ground truths. Table 1 indicates segmented object accuracies after comparing them with the ground truth labels.
On the other hand, Table 2 and Table 3 define the total computational time of the proposed method such as MFCS and MSS algorithms over MSRC and Corel-10k datasets, respectively.

3.3. Object Categorization

In this section, the proposed system used the Multiple Kernel Learning (MKL) method [32] to achieve multiple object categorization based on multiple regions and signatures of the regions in complex scenes. In object categorization, an image j (containing clusters c of multiple objects represented by different colors obtained by the segmentation process) is initially set for local descriptor D j (i.e., SIFT, HOG) and defines the region R of the image j. The signature x j is computed using a function f R from local descriptors D j as f R : D j x j . This conversion of f R is mathematically derived as follows:
C e n c = 1 | c | j i D i c j
μ c = 1 | c | j i ( D i c j C e n c ) ( D i c j C e n c )
μ j , c = i ( D i c j C e n c ) ( D i c j C e n c ) μ c
where C e n c is used for the center of clusters c , | c | represents the total descriptors in the clusters c of all the images of a class, descriptors of image j that belong to cluster c are shown as D i c j and μ c represents the mean of centered descriptors that belong to clusters c . μ j , c represents the computation of the signature of an image j . Then μ j , c is converted into a vector v e c j , C . The signature vector x j of image j for all clusters c is computed by the concatenation of all v e c j , C
v e c j = ( v e c j , 1 v e c j , C )
Figure 7 indicates the results of HOG and SIFT descriptors. These descriptors of defined region R are operated using a deformable parts model [33]. It produces multiple regions by drawing rectangular bounding boxes [34] over the images. The proposed system only uses bounding box regions with maximum scores given by the detector. These rectangular bounding boxes are used to indicate the regions of different foreground objects.
After defining accurate regions R of objects within the image, similarity based on the signature (extracted vectors) of this region R in i and j images is measured using kernel function k R as:
k ( i , j ) = f R ( D R i ) , f R ( D R j )
However, an image holds multiple regions to achieve similarity over the entire image. Therefore, the proposed system computes similarity as:
k ( i , j ) = ω k ( i , j )
where ω is associated with weights of multiple regions. Figure 8 illustrates the objects categorization method using multiple kernel learning.

3.4. Scene Classification

After multiple object categorization, the labeled information is further used for scene classification. This includes two significant approaches, (1) Expected Intersection over Union (EIOU) [35] score and (2) Multi-class Logistic Regression (McLR) [36]. EIOU is measured for the foreground objects and McLR is used to solve the multi-class classification problem which recognizes scenes in the images.

3.4.1. Expected Intersection over Union score (EIOU)

The EIOU score is used to indicate how accurately we have predicted the objects and the regions of predicted objects. The EIOU score is given to all foreground objects in the images of all scenes by the proposed system and the scene is classified based on the EIOU of the foreground objects. To examine the EIOU function, we used the multiple objects y j , their locations and the predicted objects y ¯ j . The Expected Intersection over Union U E I O U are achieved as follows:
U E I O U = 1 C C = 1 C U i o u ( C ) ( y ¯ , y )
where C is the number of classes and U i o u ( C ) is defined as:
U i o u ( C )   ( y ¯ , y ) = j V 1 { y ¯ j k y j C } j V 1 { y ¯ j = k y j = C }
where y j ϵ 1 , , C     j V and V shows all pixels set in all images. 1 { y ¯ j k y j C } represents the indicator function which gives the 1 if { y ¯ j k y j C } is true otherwise it gives 0. The ratio of the sum of pixels represents the value of U i o u ( C ) as the EIOU score of objects. The computed EIOU score is shown over the objects as in Figure 9.

3.4.2. Multi-Class Logistic Regression (McLR)

McLR is used for the classification of a whole scene based on multiple objects and their features. If there are multiple classes, McLR predicts the probability of given class x belongs to j t h (i.e., all classes of datasets). During McLR, a classifier is designed to distinguish multiple c = 1 , 2 , K classes having L labeled training images using the feature vector as input. The L labels of all training images are T L = { ( x 1 , z 1 ) , , ( x L , z L ) } and the posterior class distribution (PCD) is achieved for the estimation of the ω ^ logistic regressor. Figure 10 shows the systematic flow of multi-class logistic regression.
The McLR is achieved as follows:
P ( z 1 = c | x j , w ) = exp ( w ( c ) x j ) c = 1 K exp ( w ( c ) x j )
where w ( c ) is used as a logistic regressor for class c, the feature vectors are shown as x = ( x 1 , , x j ) and set logistic regressors are shown as w ( c ) = [ w 1 ( c ) , ,   w K ( c ) ] T for class c. The posterior class probability of regressor w is achieved as follows:
P ( w | z L , x L ) α p ( z L | x L , w ) p ( w | x L )
During testing, the posterior class probabilities for all feature vectors in the classes c are determined by entering the regressor into the McLR model. The class label of a feature vector is achieved by the index of the maximum posterior probability of the given test vector. The results of scene classification using McLR are shown in Figure 11 and Figure 12.

4. Experimental Setup and Evaluation

In this section, we present details of the experimental setup and evaluation. Object segmentation accuracy and computation time are used for performance evaluation of the proposed system for challenging indoor and outdoor datasets. We used Matlab to carry-out the experiments with a hardware system using an Intel Core i3 CPU of 2.5 GHz and 8 GB of RAM. To evaluate the performance of the proposed scene classification system, we used three different datasets: MSRC [37], Corel-10k [38] and CVPR 67 [39] datasets. For the training/testing of datasets, we used a leave-one-out-cross validation method. For the training and testing set, datasets are split into 1 and n-1 observation sets for testing and training respectively. Then, prediction weights are observed for each observation set. All the details of each dataset, their experimental results and comparisons of the proposed scene classification method with other state-of-the-art scene classification methods are given below.

4.1. Dataset Descriptions

4.1.1. MSRC Dataset

In the MSRC dataset, we are dealing with 591 scene images. We used twenty classes for the experimental evaluation: flower, boat, sheep, dog, car, chair, cow, bird, road, body, grass, building, sky, tree, sign, cat, water, bicycle, book and duck. Figure 13 shows example images from the MSRC dataset. Such dataset is comprised of various complicated scene images with the resolution of 213 × 320 having various objects.

4.1.2. Corel-10k Dataset

The Corel-10k dataset contains 10,000 scene images, which include multiple classes and have challenging images of different sizes and backgrounds. We performed experimental evaluations over twenty classes which included rhino, deer, car, water, building, elephant, plane, tree, tiger, bike, wolf, dog, boat, flower, bear, sky, land, cat, bird and fish. Figure 14 presents example images of the Corel-10k dataset.

4.1.3. CVPR 67 indoor Scene Dataset

CVPR 67 dataset contains 67 indoor scene classes and 15,620 total images, each class consisting of 100 scene images. We performed experimental evaluation on all classes of indoor scenes (i.e., kitchen, bedroom, bathroom, corridor, elevator, locker-room, waiting-room, dining-room, game-room and garage). Figure 15 presents some example images of the CVPR 67 indoor scene dataset.

4.2. Experimental Results

For experiments, mean classification accuracy and comparison with existing methods were investigated by considering the indoor-outdoor scenes of all images. The proposed system achieved sufficiently informative enough results due to robust object segmentation techniques (i.e., MFCS and MSS) which reflect better performance in scene classification.

4.2.1. Experiment 1: Using the MSRC Dataset

Considering the MSRC dataset, the proposed system was applied for scene classification accuracy. Table 4 shows that the major scene classes of the MSRC dataset produce remarkable performance in terms of accuracy. Table 5 summarizes the comparison of classification accuracy of the proposed method and it shows significantly better results (88.75%) than all other state-of-the-art methods.

4.2.2. Experiment 2: Using the Corel-10k Dataset

During experiments using the Corel-10k dataset, the proposed method is used with 20 different scenes and it obtained the highest classification accuracy score (85.75%) as shown in Table 6. Similarly, Table 7 shows that the proposed method has significantly higher recognition accuracy than the other state-of the-art methods such as VLAD, TNNV and LLC.

4.2.3. Experiment 3: Using the CVPR 67 Indoor Scene Dataset

In the experimental evaluation using the CVPR 67 indoor scene data, the proposed method achieved scene classification accuracy of (80.02%) over 10 different classes of the CVPR 67 indoor scene dataset. The accuracy of the CVPR 67 dataset is less than the MSRC and the Corel-10k dataset caused by multiple occluded objects in different real-world scenes used in the dataset. When an object is hidden behind other objects, it is difficult to recognize it due to this occlusion effect. Table 8 shows the confusion matrix of classification using the CVPR 67 dataset.

5. Conclusions

In this work, we proposed a new effective scene classification system that segments single/multiple objects and classifies complex indoor-outdoor scenes. With the proposed system, object segmentation problems were explored using two robust algorithms—MFCS and MSS. In addition, object similarity was examined by multiple kernel learning. Logistic regression was used for complex scene classification. Experimental evaluations reveal that our proposed system consistently outperforms others state-of-art systems in terms of computation, segmentation and accuracy.
In future research work, we will analyze scenery images in depth to improve the accuracy of scene classification and we will work to decrease the computational complexity of scene classification. We will work in future on deep learning for indoor-outdoor scene classification to further improve classification accuracy and to expand the applicability of our work.

Author Contributions

Conceptualization: A.A.; methodology: A.J.; software: A.A.; validation: A.A. and A.J.; formal analysis: K.K.; resources: A.J. and K.K.; writing—review and editing: A.J. and K.K.; funding acquisition: A.J. and K.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (no. 2018R1D1A1A02085645). Also, it was supported by a grant (19CTAPC152247-01) from Technology Advancement Research Program funded by Ministry of Land, Infrastructure and Transport of Korean government.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Liu, Y.; Gu, Y.; Yan, F.; Zhuang, Y. Outdoor Scene Understanding Based on Multi-Scale PBA Image Features and Point Cloud Features. Sensors 2019, 19, 4546. [Google Scholar] [CrossRef]
  2. Chen, C.; Li, S.; Fu, X.; Ren, Y.; Chen, Y.; Kuo, C.C.J. Exploring confusing scene classes for the places dataset: Insights and solutions. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, Kuala Lumpur, Malaysia, 12 December 2017; pp. 550–558. [Google Scholar]
  3. Chen, L.; Cui, X.; Li, Z.; Yuan, Z.; Xing, J.; Xing, X.; Jia, Z. A New Deep Learning Algorithm for SAR Scene Classification Based on Spatial Statistical Modeling and Features Re-Calibration. Sensors 2019, 19, 2479. [Google Scholar] [CrossRef]
  4. Chen, C.; Ren, Y.; Kuo, C.C.J. Outdoor scene classification using labeled segments. In Big Visual Data Analysis; Springer: Singapore, 2016; pp. 65–92. [Google Scholar]
  5. Susan, S.; Agrawal, P.; Mittal, M.; Bansal, S. New shape descriptor in the context of edge continuity. CAAI Trans. Intell. Technol. 2019, 4, 101–109. [Google Scholar] [CrossRef]
  6. Chen, C.; Ren, Y.; Kuo, C.C.J. Large-scale indoor/outdoor image classification via expert decision fusion (edf). In Proceedings of the Asian Conference on Computer Vision, Los Angeles, CA, USA, 1 November 2014; pp. 426–442. [Google Scholar]
  7. Zhang, C.; Cheng, J.; Li, L.; Li, C.; Tian, Q. Object categorization using class-specific representations. IEEE Trans. Neu. Net. Learn. Sys. 2017, 29, 4528–4534. [Google Scholar] [CrossRef] [PubMed]
  8. Rafique, A.A.; Jalal, A.; Ahmed, A. Scene Understanding and Recognition: Statistical Segmented Model using Geometrical Features and Gaussian Naïve Bayes. In Proceedings of the IEEE conference on International Conference on Applied and Engineering Mathematics, Texila, Pakistan, 27 August 2019; pp. 225–230. [Google Scholar]
  9. Jalal, A.; Kamal, S.; Kim, D. A depth video sensor-based life-logging human activity recognition system for elderly care in smart indoor environments. Sensors 2014, 14, 11735–11759. [Google Scholar] [CrossRef] [PubMed]
  10. Shokri, M.; Tavakoli, K. A review on the artificial neural network approach to analysis and prediction of seismic damage in infrastructure. Int. J. Hydromechatron. 2019, 4, 178–196. [Google Scholar] [CrossRef]
  11. Sezgin, M.; Sankur, B. Survey over image thresholding techniques and quantitative performance evaluation. J. Elect. Imaging 2004, 13, 46–166. [Google Scholar]
  12. Sujji, G.E.; Lakshmi, Y.V.S.; Jiji, G.W. MRI brain image segmentation based on thresholding. Int. J. Adv. Comput. Res. 2013, 3, 97. [Google Scholar]
  13. Bi, S.; Liang, D. Human segmentation in a complex situation based on properties of the human visual system. Intell. Control Autom. 2006, 2, 9587–9590. [Google Scholar]
  14. Yan, M.; Cai, J.; Gao, J.; Luo, L. K-means cluster algorithm based on color image enhancement for cell segmentation. In Proceedings of the 5th International Conference on BioMedical Engineering and Informatics, Chongqing, China, 16 October 2012; pp. 295–299. [Google Scholar]
  15. Kamdi, S.; Krishna, R.K. Image segmentation and region growing algorithm. Int. J. Comput. Tecnol. Elect. Eng. 2012, 2, 103–107. [Google Scholar]
  16. Wong, S.C.; Stamatescu, V.; Gatt, A.; Kearney, D.; Lee, I.; McDonnell, M.D. Track everything: Limiting prior knowledge in online multi-object recognition. IEEE Trans. Image Proc. 2017, 26, 4669–4683. [Google Scholar] [CrossRef] [PubMed]
  17. Sumbul, G.; Cinbis, R.G.; Aksoy, S. Multisource Region Attention Network for Fine-Grained Object Recognition in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2019, 57, 4929–4937. [Google Scholar] [CrossRef]
  18. Martin, S. Sequential bayesian inference models for multiple object classification. In Proceedings of the 14th International Conference on Information Fusion, Chicago, IL, USA, 5 July 2011; pp. 1–6. [Google Scholar]
  19. Lecumberry, F.; Pardo, A.; Sapiro, G. Multiple shape models for simultaneous object classification and segmentation. In Proceedings of the 16th IEEE International Conference on Image Processing, Cairo, Egypt, 7–10 November 2009; pp. 3001–3004. [Google Scholar]
  20. Jalal, A.; Kim, Y.H.; Kim, Y.J.; Kamal, S.; Kim, D. Robust human activity recognition from depth video using spatiotemporal multi-fused features. Pattern Recognit. 2017, 61, 295–308. [Google Scholar] [CrossRef]
  21. Shi, J.; Zhu, H.; Yu, S.; Wu, W.; Shi, H. Scene Categorization Model Using Deep Visually Sensitive Features. IEEE Access 2019, 7, 45230–45239. [Google Scholar] [CrossRef]
  22. Zhang, C.; Cheng, J.; Tian, Q. Multiview, Few-Labeled Object Categorization by Predicting Labels with View Consistency. IEEE Trans. 2019, 49, 3834–3843. [Google Scholar] [CrossRef]
  23. Zhou, L.; Zhou, Z.; Hu, D. Scene classification using a multi-resolution bag-of-features model. Pattern Recognit. 2013, 46, 424–433. [Google Scholar] [CrossRef]
  24. Hayat, M.; Khan, S.H.; Bennamoun, M.; An, S. A Spatial Layout and Scale Invariant Feature Representation for Indoor Scene Classification. IEEE Trans. Image Proc. 2016, 25, 4829–4841. [Google Scholar] [CrossRef]
  25. Zou, J.; Li, W.; Chen, C.; Du, Q. Scene classification using local and global features with collaborative representation fusion. Inf. Sci. 2016, 348, 209–226. [Google Scholar] [CrossRef]
  26. Ismail, A.S.; Seifelnasr, M.M.; Guo, H. Understanding Indoor Scene: Spatial Layout Estimation, Scene Classification, and Object Detection. In Proceedings of the 3rd International Conference on Multimedia Systems and Signal Processing, Shenzhen, China, 28 April 2018; pp. 64–70. [Google Scholar]
  27. Tingting, Y.; Junqian, W.; Lintai, W.; Yong, X. Three-stage network for age estimation. CAAI Trans. Intell. Technol. 2019, 4, 122–126. [Google Scholar] [CrossRef]
  28. Mahajan, S.M.; Dubey, Y.K. Color image segmentation using kernalized fuzzy c-means clustering. In Proceedings of the 2015 Fifth International Conference on Communication Systems and Network Technologies, Gwalior, India, 4–6 April 2015; pp. 1142–1146. [Google Scholar]
  29. Zhu., C.; Miao, D. Influence of kernel clustering on an RBFN. CAAI Trans. Intell. Technol. 2019, 4, 255–260. [Google Scholar] [CrossRef]
  30. Gandhi, N.J.; Shah, V.J.; Kshirsagar, R. Mean shift technique for image segmentation and Modified Canny Edge Detection Algorithm for circle detection. In Proceedings of the 2014 International Conference on Communication and Signal Processing, Melmaruvathur, India, 3–5 April 2014; pp. 246–250. [Google Scholar]
  31. Wiens, T. Engine speed reduction for hydraulic machinery using predictive algorithms. Int. J. Hydromechatron. 2019, 1, 16–31. [Google Scholar] [CrossRef]
  32. Durand, T.; Picard, D.; Thome, N.; Cord, M. Semantic pooling for image categorization using multiple kernel learning. In Proceedings of the 2014 IEEE International Conference on Image Processing, Paris, France, 27–30 October 2014; pp. 170–174. [Google Scholar]
  33. Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 32, 1627–1645. [Google Scholar] [CrossRef] [PubMed]
  34. Osterland., S.; Weber, J. Analytical analysis of single-stage pressure relief valves. Int. J. Hydromechatron. 2019, 2, 32–53. [Google Scholar] [CrossRef]
  35. Nowozin, S. Optimal decisions from probabilistic models: The intersection-over-union case. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 548–555. [Google Scholar]
  36. Behadada, O.; Trovati, M.; Chikh, M.A.; Bessis, N.; Korkontzelos, Y. Logistic regression multinomial for arrhythmia detection. In Proceedings of the 2016 IEEE 1st International Workshops on Foundations and Applications of Self* Systems (FAS*W), Augsburg, Germany, 12–16 September 2016; pp. 133–137. [Google Scholar]
  37. Shotton, J.; Winn, J.; Rother, C.; Criminisi, A. Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In Proceedings of the European Conference on Computer Vision, Graz, Austria, 7–13 May 2016; pp. 1–15. [Google Scholar]
  38. Liu, G.H.; Yang, J.Y.; Lo, Z.Y. Content-based image retrieval using computational visual attention model, Pattern Recognition. Pattern Rec. 2015, 48, 2554–2566. [Google Scholar] [CrossRef]
  39. Quattoni, A.; Torralba, A. Recognizing indoor scenes. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20 June 2009; pp. 413–420. [Google Scholar]
  40. Irie, G.; Liu, D.; Li, Z.; Chang, S.F. A bayesian approach to multimodal visual dictionary learning. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 329–336. [Google Scholar]
  41. Mottaghi, R.; Fidler, S.; Yuille, A.; Urtasun, R.; Parikh, D. Human-machine CRFs for identifying bottlenecks in scene understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 74–87. [Google Scholar] [CrossRef]
  42. Liu, X.; Yang, W.; Lin, L.; Wang, Q.; Cai, Z.; Lai, J. Data-driven scene understanding with adaptively retrieved exemplars. IEEE Multidiscip. 2015, 22, 82–92. [Google Scholar] [CrossRef]
  43. Jegou, H.; Perronnin, F.; Douze, M.; Sánchez, J.; Perez, P.; Schmid, C. Aggregating local image descriptors into compact codes. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 34, 1704–1716. [Google Scholar] [CrossRef]
  44. Long, X.; Lu, H.; Peng, Y.; Wang, X.; Feng, S. Image classification based on improved VLAD. Multimed. Tools Appl. 2016, 75, 5533–5555. [Google Scholar] [CrossRef]
  45. Cheng, C.; Long, X.; Li, Y. VLAD Encoding Based on LLC for Image Classification. In Proceedings of the 2019 11th International Conference on Machine Learning and Computing, Zhuhai, China, 22–24 February 2019; pp. 417–422. [Google Scholar]
Figure 1. Overview of the proposed scene classification system using Multi-class Logistic Regression.
Figure 1. Overview of the proposed scene classification system using Multi-class Logistic Regression.
Sensors 20 03871 g001
Figure 2. Some preprocessing steps include, (a) depicting noisy images and (b) filtering noise on images over the MSRC dataset.
Figure 2. Some preprocessing steps include, (a) depicting noisy images and (b) filtering noise on images over the MSRC dataset.
Sensors 20 03871 g002
Figure 3. A few examples of super pixel-based pre-segmentation.
Figure 3. A few examples of super pixel-based pre-segmentation.
Sensors 20 03871 g003
Figure 4. Single and multiple objects segmentation using MFCS. The 1st row shows the original while the 2nd row shows the segmentation results.
Figure 4. Single and multiple objects segmentation using MFCS. The 1st row shows the original while the 2nd row shows the segmentation results.
Sensors 20 03871 g004
Figure 5. Results of Mean shift-based segmentation. The 1st row shows the original while the 2nd row shows the segmentation results.
Figure 5. Results of Mean shift-based segmentation. The 1st row shows the original while the 2nd row shows the segmentation results.
Sensors 20 03871 g005
Figure 6. Comparison of objects segmentation examples images; (a) original images, (b) ground truth (c) MFCS results and (d) MSS results over MSRC dataset.
Figure 6. Comparison of objects segmentation examples images; (a) original images, (b) ground truth (c) MFCS results and (d) MSS results over MSRC dataset.
Sensors 20 03871 g006
Figure 7. Local feature descriptor, (a) original images, (b) HOG feature extraction and (c) SIFT feature extraction results over MSRC dataset. The 1st row shows the locations while the 2nd row shows the scale and orientation of key points.
Figure 7. Local feature descriptor, (a) original images, (b) HOG feature extraction and (c) SIFT feature extraction results over MSRC dataset. The 1st row shows the locations while the 2nd row shows the scale and orientation of key points.
Sensors 20 03871 g007
Figure 8. Results of object categorization using multiple kernel learning.
Figure 8. Results of object categorization using multiple kernel learning.
Sensors 20 03871 g008
Figure 9. Demonstration of EIOU score over multiple objects at MSRC dataset.
Figure 9. Demonstration of EIOU score over multiple objects at MSRC dataset.
Sensors 20 03871 g009
Figure 10. Flow architecture of multi-class logistic regression.
Figure 10. Flow architecture of multi-class logistic regression.
Sensors 20 03871 g010
Figure 11. Some examples of object classification in outdoor scenes using the McLR algorithm over the MSRC object dataset.
Figure 11. Some examples of object classification in outdoor scenes using the McLR algorithm over the MSRC object dataset.
Sensors 20 03871 g011
Figure 12. Some examples of indoor scene classification using the McLR algorithm over the CVPR 67 indoor scene dataset.
Figure 12. Some examples of indoor scene classification using the McLR algorithm over the CVPR 67 indoor scene dataset.
Sensors 20 03871 g012
Figure 13. Example images from the MSRC dataset.
Figure 13. Example images from the MSRC dataset.
Sensors 20 03871 g013
Figure 14. Example images from the Corel-10k dataset.
Figure 14. Example images from the Corel-10k dataset.
Sensors 20 03871 g014
Figure 15. Example images from the CVPR 67 indoor scene dataset.
Figure 15. Example images from the CVPR 67 indoor scene dataset.
Sensors 20 03871 g015
Table 1. Objects Segmentation Accuracy of MFCS Algorithm over MSRC Dataset.
Table 1. Objects Segmentation Accuracy of MFCS Algorithm over MSRC Dataset.
ClassesflboshdocacoBi
Accuracy (%)92.388.696.494.682.79487
ClassesrobdgrchdubuSk
Accuracy (%)83.386.889.379.988.484.887
Classestrsictwtbcbk
Accuracy (%)84.478.287.99279.878
Mean Segmentation Accuracy = 86.77 %
fl = flower; bo = boat; sh = sheep; do = dog; ca = car; co = cow; bi = bird; ro = road; bd = body; gr = grass; ch = chair; du = duck; bu = building; sk = sky; tr = tree; si = sign; ct = cat; wt = water; bc = bicycle; bk = book.
Table 2. Comparison of Computation Time of Objects Segmentation Algorithms over MSRC Dataset.
Table 2. Comparison of Computation Time of Objects Segmentation Algorithms over MSRC Dataset.
ClassMFCSMSSClassMFCSMSS
fl76.578.2ch84.698.7
bo47.447.8bu92.393.4
sh65.971.2sk32.735.5
do35.243.5tr54.561.8
ca45.846.1si46.747.0
co97.5101.5ct63.165.2
bi41.143.7wt29.833.5
ro52.653.1bc36.241.8
bd39.242.9bk54.752.1
gr51.452.2du172.9201.5
Mean computational time of the MFCS algorithm = 61.00 s
Mean computational time of the MSS algorithm = 65.53 s
Table 3. Comparison of Computation Time for Object Segmentation Algorithms over Corel-10k Dataset.
Table 3. Comparison of Computation Time for Object Segmentation Algorithms over Corel-10k Dataset.
ClassMFCS MSSClassMFCSMSS
rh112.0131.2wo130.6149.5
dr130.1143.5do129.1148.2
ca91.7105.0bo150168.9
wa87.499.3fl114.5126.1
bu171.0188.9be145.8166.0
el96.5114.2sk89.0104.5
ai150.2170.3la97.5113.2
tr94.1105.9ct122.9143.9
ti133.2156.3bd131.2157.0
bi170.9199.2fi135.0162.7
Mean computational time of the MFCS algorithm = 124.13 s
Mean computational time of the MSS algorithm = 142.69 s
rh = rhino; dr = deer; ca = car; wa = water; bu = building; el = elephant; ai = airplane; tr = tree; ti = tiger; bi = bike; wo = wolf; do = dog; bo = boat; fl = flower; be = beer; sk = sky; la = land; ct = cat; bd = bird; fi = fish.
Table 4. Confusion matrix of accuracy scores for object classification in outdoor scenes for the proposed approach and the MSRC dataset.
Table 4. Confusion matrix of accuracy scores for object classification in outdoor scenes for the proposed approach and the MSRC dataset.
flboshdocacobirobdgrchdubusktrsictwtbcBk
fl0.95000000.1000.400000.100000
bo00.8900000.100000.50000.100.400
sh000.920.200.500000000000.1000
do000.20.890000000000000.7000
ca00.2000.8400.70.500000.30000000
co000.7000.9300000000000000
bi0000000.900000000.90000.100
ro00000000.87000000000000
bd00000.20000.890.1000.3000.2000.10.2
gr0000000000.910000.20.900000
ch00000.1000.3000.8800.4000.20000.2
du000000000000.8500000000
bu0000000000.2000.8800000.900.1
sk0000000000000.30.870000.900.1
tr0000000000.900000.8800000
si00000000.300000.4000.890000.4
ct00000.900000000000.20.88000.1
wt00.2000000000000.80000.9000
bc00000.6000.100000.2000.2000.890
bk0.10000000000.500.2000.80000.84
fl = flower; bo = boat; sh = sheep; do = dog; ca = car; co = cow; bi = bird; ro = road; bd = body; gr = grass; ch = chair; du = duck; bu = building; sk = sky; tr = tree; si = sign; ct = cat; wt = water; bc = bicycle; bk = book.
Table 5. Comparison of the proposed method with other state-of-the art methods using the MSRC dataset.
Table 5. Comparison of the proposed method with other state-of-the art methods using the MSRC dataset.
MethodsClassification Accuracy (%)
Bayesian model [40]82.9
Scene classification using machine performance [41]81.0
Scene classification with weighted method [42]84.7
Proposed Method88.75
Table 6. Confusion matrix of accuracy for object classification of outdoor scenes for the proposed approach using the Corel-10k dataset.
Table 6. Confusion matrix of accuracy for object classification of outdoor scenes for the proposed approach using the Corel-10k dataset.
rhdecawtbuelaitrtibiwldoboflbesklactbdfi
rh0.8700000.9000.200.10.100000000
de0.30.760000.5000.800.40.2000000.200
ca000.8300.700.6000000.40000000
wt0000.910.100000000.1000.70000
bu0000.30.8400.7000.3000.40000000
el0.500000.93000.100.1000000000
ai000.300.500.90000000.20000000
tr0000.20000.91000000.600.10000
ti0.100.2000.3000.8900.40000000.100
bi000.900.200.4000.79000.60000000
wl00.10000.1000.600.880.400000000
do00.20000000.600.20.87000000.300
bo0000.40.500.3000.3000.830000000.2
fl00000000.900.20000.8400.20.3000
be0.200000.3000.3000.2000.9000000
si00000000000000.800.890.10002
sk0000.9000000000.10000.88000.2
la0000.60000.4000000.3000.40.8300
bd0000000.50000000.500.4000.830.3
fi0000.9000000.2000.20.4000.50.100.77
rh = rhino; de = deer; ca = car; wt = water; bu = building; el = elephant; ai = airplane; tr = tree; ti = tiger; bi = bike; wl = wolf; do = dog; bo = boat; fl = flower; be = beer; sk = sky; la = land; ct = cat; bd = bird; fi = fish.
Table 7. Comparison of the proposed method with other state-of-the art methods using the Corel-10k dataset.
Table 7. Comparison of the proposed method with other state-of-the art methods using the Corel-10k dataset.
MethodsClassification Accuracy (%)
VLAD [43]80.0
TNNVLAD [44]81.0
VLAD + LLC [45]83.7
Proposed Method85.75
Table 8. Confusion Matrix of scene classification accuracy for the proposed approach using the CVPR 67 indoor scene dataset.
Table 8. Confusion Matrix of scene classification accuracy for the proposed approach using the CVPR 67 indoor scene dataset.
ClassAccuracy % ClassAccuracy %ClassAccuracy %
kitchen0.89grocery store0.79nursery0.83
bedroom0.85florist0.82train station0.82
bathroom0.87church inside0.83laundromat0.79
corridor0.76auditorium0.82stairs case0.81
elevator0.80buffet0.77gym0.78
locker room0.78class room0.81tv studio0.76
waiting room0.81green house0.75pantry0.80
dining room0.83bowling0.79pool inside0. 77
game room0.79cloister0.83inside subway0.79
garage0.82concert hall0.81wine cellar0.77
lobby0.77computer room0.80fast food restaurant0.76
office0.79dental office0.84bar0.82
mall0.81library0.79clothing store0.81
Laboratory wet0.77inside bus0.77casino0.83
jewelry shop0.79closet0.81deli0.79
museum0.82studio music0.79book store0.80
living room0.77lobby0.80children room0.82
movie theater0.83prison cell0.84hospital room0.79
toy store0.80hair saloon0.80kinder garden0.77
operating room0.82subway0.81shoe shop0.76
airport inside0.79warehouse0.77restaurant kitchen0.78
art studio0.80meeting room0.82bakery0.79
video store0.76
Mean Scene Classification Accuracy = 80.02 %
Back to TopTop