Mask-Aware Semi-Supervised Object Detection in Floor Plans

: Research has been growing on object detection using semi-supervised methods in past few years. We examine the intersection of these two areas for ﬂoor-plan objects to promote the research objective of detecting more accurate objects with less labeled data. The ﬂoor-plan objects include different furniture items with multiple types of the same class, and this high inter-class similarity impacts the performance of prior methods. In this paper, we present Mask R-CNN-based semi-supervised approach that provides pixel-to-pixel alignment to generate individual annotation masks for each class to mine the inter-class similarity. The semi-supervised approach has a student–teacher network that pulls information from the teacher network and feeds it to the student network. The teacher network uses unlabeled data to form pseudo-boxes, and the student network uses both label data with the pseudo boxes and labeled data as the ground truth for training. It learns representations of furniture items by combining labeled and label data. On the Mask R-CNN detector with ResNet-101 backbone network, the proposed approach achieves a mAP of 98.8%, 99.7%, and 99.8% with only 1%, 5% and 10% labeled data, respectively. Our experiment afﬁrms the efﬁciency of the proposed approach, as it outperforms the previous semi-supervised approaches using only 1% of the labels.


Introduction
Semi-supervised learning-based research is receiving more attention in the past few years, as it can use label data to increase model performance when it is impossible to annotate large datasets. The first layout of the semi-supervised approach-based learning uses consistency-based self-learning [1,2] approaches. The main idea is to create artificial labels and then predict those self-generated labels by training the model on label data with stochastic augmentations. Those self-generated labels can be the network's predictive distribution or one-hot prediction. The second improvement in semi-supervised approach-based learning is the variety of available data augmentation techniques. Data augmentation techniques boost the performance of the training network [3,4] and are also efficient for consistency-based learning [2,5]. The augmentation approaches progress from image transformation such as cropping, flipping, scaling, brightness, colour augmentation, contrast, saturation, translation, and rotation to image generation [6][7][8] and model training by reinforcement-learning [9,10]. Previously, the researchers applied supervised learning techniques for floor-plan object detection. We use the semi-supervised approach for floor-plan analysis, which matches the previous semi-supervised approaches using only 1% of the label data.
The floor-plan object detection problem has high value because of its usage in tremendous applications such as property value estimation, furniture setting and designing, etc. The floor-plan objects include furniture items, windows, doors, and walls. Humans can readily recognize floor-plan objects, but to automatically recognize and detect floor-plan objects is challenging because of the similarity between room types and furniture items. For example, the Drawing room contains a limited number of furniture items, and the furniture category of the kitchen and dining room is almost similar. There are many applications of floor-plan object detection, such as 3d reconstruction of floor-plan [11] and similarity search [12]. Floor-plan object detection is necessary for floor-plan analysis applications. Figure 1 is an overview of the floor-plan layout with different furniture items that explains room size and furniture categories. The top left room is the dining room, where a single round table is present. The top right room contains a kitchen with a bathroom. The next room is a living area where different sofa items are present. Thus, all other rooms have names according to their furniture items. This floor-plan category can help furniture installation. The semi-supervised approach-based object detection needs a small amount of labeled data with label data. There are some multi-stage approaches [13,14] that use label data for training in the first stage, followed by unlabeled data for generating pseudo labels, and then retraining on unannotated data. The model performance depends on the accuracy of the generated pseudo label, but the available training data is small, which reduces model efficiency. To increase label data, we generate pseudo labels using a semi-supervised approach and then use these pseudo labels and small portions such as 1% of the label data to train the model. We randomly sample label and labeled data, in which both portions include all classes present in available data. We used two models for our experiment on the floor-plan dataset, the first is for detector training, and the second is for generating pseudo labels for unlabeled data. This approach provides simplified multi-stage training. Further, it uses the flywheel effect [15], in which the pseudo label generator and training detector can boost each other to improve model performance with increasing training iterations. Another important benefit of this approach is that more weight is provided to the pseudo-label generator model rather than the training detector model, as it guides the training model instead of providing hard category labels, as in earlier techniques [13,14]. This approach is also proposed in the Soft-Teacher model [16]. In this network, the teacher model uses a pseudo-label generator, and the student model uses the training detector.
Using a semi-supervised approach, we detected objects on the pixel level, such as different furniture items in the floor-plan data. We used Mask R-CNN [17] with Feature Pyramid Network (FPN) [18] as a detection method, ResNet-50 [19] and ResNet-101 network as a backbone pre-trained on the ImageNet dataset [4] with the student-teacher approach. We used 1%, 5%, and 10% floor-plan images as labeled data for training and the rest of the images as label data. We used five data folds for each percentage level and calculated the final performance by taking the average of these five folds. Figure 2 compares the performance of both these backbones under the different percentage of label data settings. The increasing colour of bars indicates the increase in the percentage of label data. We obtain 98.8(%) mAP, 99.7(%) mAP, and 99.8(%) mAP on Mask R-CNN [17] with ResNet-101 [19] backbone network for 1%, 5%, and 10% floor-plan images, respectively. This paper provides an end-to-end semi-supervised approach-based object detection in the floor-plan domain. The main contribution of this work is as follows: • We present the Mask R-CNN [17]-based semi-supervised trainable network with the ResNet-50 [19] and ResNet-101 backbone network for object detection in the floor-plan domain.

•
The Mask R-CNN [17]-based semi-supervised approach improves the state-of-the-art performance on the publicly available floor-plan dataset named SFPI [20], using only 1% of the labels.

Figure 2.
Compares the performance of ResNet-50 [19] with the ResNet-101 backbone network using Mask R-CNN [17] framework under different label data settings.
The remaining paper is arranged as follows. Section 2 talks about the previous research on semi-supervised approaches-based learning and floor-plan datasets. Section 3 explain the methodology and Section 4 discusses the dataset briefly. Section 5 is about experimental setup. Section 6 discusses the evaluation matrices. In Section 7, we analyze the experimental results. Finally, Section 8 summarizes the experimental work and gives an idea about future directions.

Related Work
Object detection and semi-supervised learning are essential steps toward floor-plan image analysis. This section overviews previous work in these domains and contains three parts. The first section describes the literature about object detection. The second section explains previous semi-supervised approaches. Finally, we explain the literature on the floor-plan domain.

Object Detection and Its Applications
Object detection is the main computer-vision domain in which extensive work has been conducted in the past few years with two main types: single-stage detectors [21][22][23] and two-stage detectors [17,24,25]. The two-stage detectors extract the object regions in the first stage and then classify and localize the object in the second stage. These detectors, such as Faster R-CNN [25], firstly generate region proposals, making a separate prediction for every object in the image. In contrast, single-stage detectors perform classification and localization in one pass through the neural network. The basic difference between these detectors is the cascade filter for object proposals. These detectors provide good results on a large amount of label data and are used in different applications in many fields for Instance Segmentation [26] and object detection, such as face detection [27] and pedestrian detection [28]. It is also used in document analysis for formula detection [29], table detection [30,31], and other page object detections [32].

Semi-Supervised Learning
Semi-supervised-based image classification has two types: pseudo-label-based learning and consistency-based learning. The consistency-based learning [33][34][35] examine the similarity between original and augmented images. It provides more weight to label data than unlabeled data, which helps in perturbations of the same image for producing similar labels. There are different methods to apply perturbations using noise [33], augmentation [35], and adversarial training [34]. In [36], the author predicted the training steps to assemble the training object. In [37], the author takes the weighted average by ensembling rather than predicting the model, called the exponential mean-average (EMA). In [5,38], the authors annotated the unlabeled images with pseudo labels using the classification model and then retrained the detector using this pseudo-label data. They analyzed the effect of data augmentation for semi-supervised learning [2,39].
Semi-supervised object detection has two types: pseudo-label-based learning [14,40] and consistency-based learning [41,42]. In [14,40], labels generated from different augmented images are ensembled to predict labels of unlabeled images. In [43], pseudo-labels are generated by training the SelectiveNet [44]. In [45], the labeled image contains the detected box of the label image, and the author calculated the localization consistency estimation for the attached label image. It needs a deep detection procedure [45], as the image itself is changed. Recently, intricate augmentation approaches, including CTAugment [46] and RandAugment [47], are proven to be very effective for semi-supervised learning on object detection [1,2].

Floor-Plan Analysis
Research on object detection in floor-plan data is growing because of its usage in tremendous applications such as property value estimation, furniture setting, and designing, etc. Ghorbel et al. [48] proposed a handwritten floor-plan recognition model. This network provides a CAD model for floor-plans data. In [49], the author proposed a room detection model for the floor-plan dataset. Moreover, [50] proposed a model for understanding the floor-plan using Hough-transform and subgraph-isomorphism. Several graphic recognition methods are applied to identify the basic structure and also consider human feedback during the analysis phase.
In [51], the author used a deep learning network to parse floor-plan images. The author applied Cascade Mask R-CNN [52] to obtain floor-plan information and keypoint-CNN for segmentation to extract accurate corner locations and obtained the final segmentation results after post-processing. In [53], textural information is extracted from floor-plan images. This work is helpful for visually impaired people to analyze house design and for customers to buy a house online. The morphological closure is applied to detect the walls of the floor-plan image, the flood fill method to detect corners, and scale-invariant features for door identification. After extracting all this information, the author applied text synthesis techniques.
In [54], the author proposed an object recognition method for floor-plan images. The main target is to recognize floor-plan items such as windows, walls, rooms, doors, and furniture items. To extract features, the VGG network [55] is used. It recognizes room types based on furniture items present in the room. However, room type identification is not demonstrating good results, as the variation in furniture items is less. It also detects room boundaries for doors, windows, and walls, which gives good results.
Liu et al. [56] detected edges in the floor-plan dataset using the deep network and then used Integer programming to detect walls of different rooms by combining those corner points. However, this approach can only recognize walls of rectangular rooms with uniform thickness; it works on the Manhattan assumption that aligns the walls with two main axes in a floor-plan image. Yamasaki et al. [57] applied a fully convolutional network (FCN) to label pixels for detecting similar structure houses by forming a graph model of the floor-plan dataset with different classes. Their method ignores spatial relations between different classes, as it detects pixels of different classes separately by using a simple segmentation network.
In [58], Faster R-CNN [25] is used to detect kitchen items such as stoves, sliding doors, simple doors, and bathtubs, and then it adopted the fully convolutional network (FCN) to detect the walls' pixels. They also estimated the size of the different rooms by recognizing text using a library tool. Maće et al. [49] used Hough transform to identify doors and walls in floor-plan images. In [59], the author used a pixel-based segmentation approach to detect doors, walls, windows, and the bag-of-words (BOW) network to classify image patches. They trained these patches to generate graphs for detecting walls. The author detected the walls in [11] by recognizing parallel lines, determined the room size by calculating the distance between parallel lines, and estimated the wall thickness by the clustering distance value.

Method
The experiment is performed on Mask R-CNN [17] with ResNet-50 [19] and ResNet-101 backbone. We used this model with convolutional networks (CNN) and a student-teacher network. In this section, we explain the individual modules of the experiment.

Mask R-CNN
Mask R-CNN [17] is an extended version of Faster R-CNN [25] with a new branch for providing masks to the detected objects with the two already present branches for the classification and regression layer. This branch is applied on RoIs (Region of Interest) to deal with detection on the pixel level to segment each stance accurately. The basic architecture of Mask R-CNN is identical to Faster R-CNN, as it uses a similar architecture to generate object proposals. The major difference is that Mask R-CNN uses an RoI-align layer rather than an RoI-Pooling layer to reduce misalignment on the pixel level because of spatial quantization. Generally, the training of Mask R-CNN [17] and Faster R-CNN [25] is identical. For accuracy and speed, we prefer ResNet-101 [19] as a backbone with the feature Pyramid Network(FPN) [60]. We create the mask for each class for pixel-level classification to reduce interclass similarity. We created the ground truth for the mask using object width, height and bounding box coordinates. The masks of all classes are in square boxes having four corner points as (x min y min , x max y min , x max y max , x min y max ). Where (x min , y min ) is the first corner point of the mask and obtained other corner points by adding the width and height of the bounding box in the first corner point. The model learns the mask of each class separately and is defined as the average binary cross-entropy loss, as shown in the following Equation (1).
where y lm is the label of pixel (l,m) in true mask area M˚M and y n lm is the estimated value of the same pixel for the ground-truth class n. The loss function of Mask R-CNN [17] is the combination of localization, classification and segmentation mask loss, where classification and localization loss is the same as in Faster R-CNN [25].

Backbone Network
The model performance drops both for train and test data. This reduction is not because of overfitting. Instead, network initialization, exploding, or vanishing gradients can also cause this problem. These can be easily optimized compared to the plain network, whose training error increases with adding more layers. The ResNet-50 [19] network is formed by replacing the 2-layer block of resnet-34 with a 3-layer block. This network has a higher accuracy than the resnet-34 network. The ResNet-101 contains three more layers. We used ResNet-50 [19] and ResNet-101 backbone network for this semi-supervised experiment. Figure 3 explains the Mask R-CNN [17] framework with ResNet-101 backbone. This network obtains a convolution feature map from the backbone layer, provides anchors generated by a sliding window and predicts the regions by the Region-Proposal Network (RPN). Then, we implement a pooling process to resize and a Fully connected layer to produce three nodes as a mask, softmax classification, and bounding-box regression.

Semi-Supervised Model
Creating pseudo labels for object detection is more challenging than image classification, where the simple probability distribution is considered for a pseudo-label generation. To obtain high-quality pseudo labels and avoid overfitting, strong augmentation is applied to the student model, and weak augmentation is used for the teacher model. The performance of the model is dependent on the quality of pseudo labels. Setting a high threshold on the foreground value to obtain more student-created boxes can provide better results than a low threshold. We get the best results when the threshold value is 0.9. However, a high threshold value provides good foreground precision, and the recall of box-candidate decreases quickly. Suppose we apply intersection over union (IoU) between teacher-created pseudo-boxes and student-created box-candidate to provide background and foreground labels as an ordinary object detection model does. In that case, we incorrectly classified some foreground boxes as negative, which reduces performance.
To eliminate this problem, we use the student-teacher network to generate pseudolabels using a semi-supervised approach based on Mask R-CNN to provide pixel-to-pixel alignment to generate individual annotation masks for each class to mine the inter-class similarity and then use these pseudo labels as well as a small portion such as 1% of label data to train the model. This label and labeled data sampling includes all classes present in available data. The random samples of labeled and label images are selected using sampling ratio s r to make training batches. The teacher model uses label data to form pseudo-boxes, and the student model uses both label data with the pseudo boxes and labeled data as ground truth for training. We assessed the reliability of student-created box candidates of a real background and used it to weigh background-class loss. Equation (2) is the total loss that is the combination of unsupervised and supervised loss: where L sup represents the supervised loss of labeled data while L un represents the unsupervised loss of label data, α is the controlling factor of unsupervised loss. We normalized these losses by their respective amount of floor-plan images in the training batch. The supervised and unsupervised loss is the combination of classification, localization, and segmentation mask loss as shown in Equations (3) and (4), respectively. The mask loss is explained in Equation (1) while classification and localization loss is the same as in Faster R-CNN [25].
pL class pI n u q`L rg pI n u q`L mask pI n u qq where I n b represents n-th labeled-image, I n u represents n-th label-image, N b indicates total labeled-images, N u indicates total label-images, L class , L rg and L mask is the classification, regression, and mask loss, respectively. Figure 4 explains the overall architecture of the student-teacher approach. We initialized the teacher and student model randomly to start training; then, the student model updates the teacher model just like [2,61] using the exponential moving average (EMA) approach. Generating pseudo-labels for detecting objects is more challenging than classifying objects, as an image typically has multiple objects. To annotate those objects, we need location and category. The teacher model obtains label images to detect objects and generate many bounding boxes. The non-maximum suppression (NMS) is applied to minimize redundant boxes generated on the image objects. Even though we eliminated most iterating boxes, some non-foreground boxes remain. The FixMatch [2] is a supervised learning-based image classification approach used to get better pseudo boxes and speed up student network training. We applied weak augmentation for generating pseudo-labels by the teacher network and strong augmentation for training the student network. Calculating the reliability score is a little bit difficult. So, we used the background value generated by the teacher model using weak augmentation as a signal for the student model. This approach is just like simple negative-mining, not like OHEM [62,63] or Focal Loss [63], known as hard negative-mining. To measure the consistency of regression boxes, we used a box jittering approach in which we sample teacher-generated pseudo boxes b k and refine them by feeding those boxes into the teacher model to obtain a refined box b k , as follows: b k " f ilteredpjitterpb k qq (5) We repeated this process many times to obtain N jitter filtered jitter boxes. The location probability of an object as a regression-variance is determined as follows: whereσ n is the normalization of σ n , σ n is standard-derivation of nth coordinate of filtered jittered boxes, wpb k q is the width and hpb k q is the height of jittered box b k . The localization accuracy will be more when the regression variance of the box is smaller. However, it is not feasible to assess the regression-variance of box candidates during the training process. Thus, we compute reliability only for those boxes whose foreground value is above 0.5, reducing the number of boxes from hundreds to 16 per image, minimizing the computational cost.

Dataset
We need a large dataset with various floor-plan layouts for deep neural network training, and there should be enough classes to analyze variation in furniture items. The dataset is created from SESYD [64] named SFPI (Synthetic Floor-Plan Images) [20]. It contains 16 furniture classes as window1, sofa1, sink1, table1, door1, window2, sofa2, sink2, table2, door2, tub, sink3, table3, sink3, armchair, sink4, and bed placed in various rooms, which helps in generating more realistic results. We have 10,000 floor-plan dataset images containing 1,000 floor-plan layouts and around 300,000 furniture items of 16 classes. We have different types of augmentation to create variation in our dataset. The first type of augmentation is rotation. We rotate with a random angle between [0, 30,60,90,120,180,210,270,330]. Figure 5 shows that tub and sink furniture class items have different directions based on the provided angle. Another augmentation method is scaling with a random scaling factor between [20,40,75,100,145,185,200]. During scaling, we keep the same aspect ratio for all furniture classes. Figure 5 shows the sample image of this floor-plan dataset in which all furniture items are nearly the same size. The red and blue rectangular box objects demonstrate that sink and tub classes can vary in orientation. Further, we notice that some furniture items are present in particular rooms, which helps recognize room categories for different furniture items. This SFPI dataset is publicly available and can be downloaded from here https://cloud.dfki.de/owncloud/index.php/s/mkg5HBBntRbNo8X, accessed on 29 August 2022.

Implementation Details
We used Mask R-CNN [17] based semi-supervised approach with ResNet-50 [19] and ResNet-101 backbone pre-trained on ImageNet [4] as a detection method. The training data contains 1%, 5%, and 10% floor-plan images as labeled data and the remaining images as label training data. We have 5 data folds for each type and calculated the final performance as the average of all folds. Our methodology and hyper-parameters are formulated from MMDetection [65]. For training, we used anchors with a three-aspect ratio and five-scale value and formed 1k and 2k region proposals with a 0.7 non-maximum suppression threshold. We selected a total of 512 proposals from 2k as box candidates for the training of RCNN. The IoU threshold value is set to 0.5 for mask bounding boxes.

Partially Labeled Data
We performed training for 80 k iterations on 8 GPUs (A100) using eight images per GPU. For initial training, the learning rate has the value of 0.01 and then is reduced to 0.001 at 30 k iteration and 0.0001 at 40 k iteration. The momentum and weight decay values are 0.9 and 0.0001, respectively. The data sampling ratio has an initial value of 0.2 and then decreases to 0 for the last 5 k iterations, and the foreground threshold has a 0.9 value.
For selecting box regression pseudo-labels, we set a threshold value of 0.02, and the N jitter value is set as 10 to calculate the reliability of box localization. The jitter boxes are sampled by setting offset values for all coordinates and selecting the offsets from´6% to 6% width or height of pseudo-box candidates. Moreover, different augmentations are used, such as FixMatch [2], to generate pseudo-label and train the labeled and label data.

Fully Labeled Data
We have 150 k training iterations on 4 GPUs (A100) using eight images per GPU. For initial training, the learning rate has a value of 0.01 and then is reduced to 0.001 at 30 k iteration and 0.0001 at 40 k iteration. The momentum has a value of 0.9. The data sampling ratio has an initial value of 0.2 and then decreases to 0 for the last 15 k iterations, and the foreground threshold has a 0.9 value. The weight decay has a value of 0.0001. We assigned the N jitter value of 10 to estimate box localization probability and the threshold value of 0.02 for selecting box regression pseudo labels.

Evaluation Criteria
We used some detection evaluation metrics to evaluate the performance of the semisupervised based floor-plan object detection approach. This section explains the evaluation metrics used.

Intersection over Union
We calculated the intersection over union(IoU) in Equation (8) by taking the intersection divided by the union for the area of the ground-truth box A g and the generated bounding box A p .
IoU is used to estimate whether a detected object is false positive or true positive.

Average Precision
We calculated the average precision(AP) using a precision-recall curve. It is the area under the precision-recall curve and can be determined using the following Equation (9): where R1, R2, . . ., R k are the values of the recall parameter.

Mean Average Precision
The mean average precision (mAP) is the most common metric for evaluating the performance of object detection methods. We calculate it by taking the mean of average precision for all classes s of the dataset. While working with a floor-plan dataset, it is preferred to calculate mAP to lower 16 classes to a set of classes s. The overall performance of mAP depends on class mapping, where a slight change in the performance of one class can affect overall mAP; that is the only drawback of mAP. We set the IoU threshold value of 0.5 and 0.75 to calculate the mAP, as shown in Equation (10): where S is the total number of classes. For our floor-plan dataset S, its value is 16.

Results and Discussion
We use Mask R-CNN [17] based semi-supervised network on the floor-plan dataset. This section will explain the qualitative as well as quantitative results of the studentteacher network. For our experiment, we take 1%, 5%, and 10% floor-plan images as labeled data and the rest of the floor-plan images as label data. We have 5 data folds for each type and calculated the final performance as the average of all folds. We train and evaluate the approach on Faster-RCNN [25], and Mask R-CNN [17]. Furthermore, we also compare the algorithm's performance on ResNet-50 [19] and ResNet-101 backbone with Mask R-CNN [17]. Figure 6 shows the average precision of every class separately. It is evident that some classes, such as armchairs, door2, table3, bed, table1, tub, door1, sink2 and table2 demonstrate one average precision, while all other classes show average precision above 0.95 except window1 class. We can observe for which classes our model performs well and where we need further improvements. Figure 7 shows the furniture items detection and localization on the floor-plan test dataset. The final result, where furniture items are detected and labeled in different colours, accurately detects all 16 classes.  Using different backbone networks, we determine the relative error between Mask R-CNN [17] and Faster R-CNN [25] detectors. Table 1 shows a comparison of these detectors with ResNet-50 [19] and ResNet-101 backbone on floor-plan dataset under semi-supervised setting. It shows that Mask R-CNN decreases the error by 8.94%, 16%, and 37.5% with ResNet-50 backbone and 29.4%, 50%, and 50% with ResNet-101 backbone for 1%, 5%, and 10% label data, respectively. Using 1% labeled data, we are obtaining 98.8% mAP on Mask R-CNN with ResNet-101 backbone, which demonstrates that this approach provides the best results using a small amount of labeled data. This comparison also demonstrates that the ResNet-101 backbone provides better results than the ResNet-50 [19] backbone for both detectors. We also study the behaviour of hyper-parameters on model performance. The first hyperparameter is the jittered-box value that calculates the localization reliability of pseudo boxes. Table 2 compares the performance under different values of jittered boxes. By setting a jittered box value of 10 it gives a mAP of 99.6%, while AP 0.5 and AP 0.75 are 99.8% and 99.7%, respectively. We can observe from Table 2 that the model gives the highest accuracy shown in bold when N jitter has a value of 10. We apply intersection over union (IoU) between the teacher-created pseudo-boxes and student-created box-candidate to provide background and foreground labels as an ordinary object detection model does. In that case, some foreground boxes are incorrectly classified as negative, reducing performance. Table 3 shows the box regression-variance threshold. We obtain the best results shown in bold by setting the threshold value to 0.02. However, a high threshold value provides good foreground precision, and the recall of box-candidate decreases quickly. Figure 8 shows a test image where some furniture items are miss-classified. The network confuses between window1 and window2. The green box wrongly detects two windows, as one window is named window2. The size of window1 and window2 objects is small compared to all other floor-plan objects. The detection performance of such small objects can be improved further, where the background occupies 95% area of the image.  Table 4 shows the comparison of our semisupervised network performance with previously presented semi-supervised approaches on an average of five data folds with 1%, 5%, and 10% floor plan label data. For supervised training on Mask R-CNN, we used just 1%, 5%, and 10% label data for training. This maskaware semi-supervised training gives 98.8% mAP on just 1% labels, as this dataset is formed by applying different augmentation approaches explained in Section 4. This behavior can also be observed in other semi-supervised approaches, as they also give high mAP on just 1% label data. It is observed from Table 4 that our Mask R-CNN-based semi-supervised approach shown in bold outperforms the previous semi-supervised approaches.  Table 5 shows the comparison of our semi-supervised network performance for five data folds with 10% label data on Faster R-CNN [25] and Mask R-CNN [17] with previously presented supervised approaches. We can not directly compare the results of Ziran et al. [68] because of the different datasets. It is observed from Table 5 that the semisupervised approach outperforms the previous supervised approaches using just 10% of label data. Table 5. Previously compared supervised detectors with our semi-supervised approach that is trained on the floor-plan dataset using Mask R-CNN [17] and Faster R-CNN [25] with ResNet-101 [19] backbone on 10% label data. *We cannot directly compare the results of Ziran et al. [68] because of the different datasets.

Conclusions and Future Work
We examine the capabilities of the semi-supervised approach to detect objects in floorplan data. It pulls information from the teacher network and feeds it to the student network. The teacher model uses label data to form pseudo-boxes, and the student model uses both label data (with the pseudo boxes) and labeled data as ground truth for training. On Mask R-CNN [17] detector with ResNet-101 backbone, the proposed approach achieves 98.8(%) mAP, 99.7(%) mAP, 99.8(%) mAP with 1%, 5%, and 10% labeled data, respectively. We can observe from the results that we can obtain the best performance by just using 1% labeled data. Furthermore, this experiment can be implemented in various floor-plan applications such as floor-plan text generation, and furniture fitting, helping impaired people to analyze house design and for customers to buy a house online. Earlier, all these applications used supervised learning approaches [68,69] for floor-plan object detection. However, now with our experiment, it is clear that the semi-supervised [16] approach gives better results for these applications.
In future, we can improve Mask R-CNN [17]-based semi-supervised floor-plan detection system in different ways. We can add text information to detect room types, especially rooms that are not physically separated, like the dining hall attached to the kitchen. We can also label rooms according to their functionality. Further research using noisy labels in training and uncertainty estimation are also a few important topics to boost the efficiency of semi-supervised-based object detection.