A Hybrid Image Segmentation Method for Accurate Measurement of Urban Environments

Kim, Hyungjoon; Lee, Jae Ho; Lee, Suan

doi:10.3390/electronics12081845

Open AccessArticle

A Hybrid Image Segmentation Method for Accurate Measurement of Urban Environments

by

Hyungjoon Kim

¹

,

Jae Ho Lee

² and

Suan Lee

^1,*

¹

School of Computer Science, Semyung University, Jecheon 27136, Republic of Korea

²

Department of Landscape Architecture, University of Seoul, Seoul 02504, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(8), 1845; https://doi.org/10.3390/electronics12081845

Submission received: 22 March 2023 / Revised: 10 April 2023 / Accepted: 11 April 2023 / Published: 13 April 2023

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In the field of urban environment analysis research, image segmentation technology that groups important objects in the urban landscape image in pixel units has been the subject of increased attention. However, since a dataset consisting of a huge amount of image and label pairs is required to utilize this technology, in most cases, a model trained with a dataset having similar characteristics is used for analysis, and as a result, the quality of segmentation is poor. To overcome this limitation, we propose a hybrid model to leverage the strengths of each model in predicting specific classes. In particular, we first introduce a pre-processing operation to reduce the differences between the collected urban dataset and public dataset. Subsequently, we train several segmentation models with a pre-processed dataset then, based on the weight rule, the segmentation results are fused to create one segmentation map. To evaluate our proposal, we collected Google Street View images that do not have any labels and trained a model using the cityscapes dataset which contains foregrounds similar to the collected images. We quantitatively assessed its performance using the cityscapes dataset with ground truths and qualitatively evaluated the results of GSV data segmentation through user studies. Our approach outperformed existing methods and demonstrated the potential for accurate and efficient urban environment analysis using computer vision technology.

Keywords:

urban environment analysis; streetscapes; image segmentation; hybrid model; deep learning

1. Introduction

Semantic segmentation is a major task in computer vision, the purpose of which is to group similar regions. Since digital images are composed of pixels, image segmentation techniques segment images by predicting the classes of all pixels. Recently, as segmentation technologies based on convolutional neural networks have been actively researched, they have been used in the field of autonomous driving and medical imaging. These segmentation technologies must predict the shape of objects while detecting the location of objects in an image. Since image segmentation technologies extract important information for urban environment analysis from images of landscapes, such as the amount of green space, foreground openness, and proportion of roads and building sidewalks in a scene, urban environment analysis research fields are also paying attention to this technology [1,2]. This technique is extensively employed in research to enhance human activities in urban settings, such as assessing the walking environment of individuals in street settings, gauging the greenery of trees, or identifying areas with high crime rates [2,3]. Although algorithms that can accurately recognize objects and partition regions are yet to exist, the technique is widely used in various fields as it can effectively and quickly process large-scale data, which are often challenging and expensive to examine individually in the field [4,5].

Recent studies have also utilized semantic segmentation to analyze images viewed from above, such as aerial or satellite images [6]. However, more recent studies aim to identify visual characteristics from the human point of view, such as Google Street View (GSV) [7,8]. As a result, efforts are being made to classify pedestrian-friendly streets by measuring the amount of greenery on the roadside in pixel units, following recent research that suggests greenery in street environments positively affects pedestrian experience [9,10]. Furthermore, researchers are attempting to prove the effect of greenery by comparing the results with actual field survey data [11,12].

Most prior studies on semantic segmentation of collected GSV data employ pre-trained models using cityscapes datasets [13]. However, the accuracy of these models is suboptimal due to the differences in image characteristics between the data used for analysis and the data used for training [11]. Firstly, the captured image’s location differs, resulting in variations in the overall scene, such as the building’s surrounding shape and the road structure, compared to the trained data. Secondly, the raw data of different image sizes and aspect ratios from the existing dataset are subjected to cropping or reshaping, reducing accuracy. Furthermore, accuracy is low since the class required for calculating the green area ratio is different from the class provided in cityscapes [12]. In other words, the gap between the published dataset and the GSV data leads to low accuracy. Therefore, more accurate techniques are needed to segment unlabeled GSV images and identify pedestrian-friendly streets.

A semantic segmentation widely uses Intersection over Union (IoU) as an evaluation metric, which is the ratio between the number of correctly predicted pixels and the number of incorrectly predicted pixels. In general, the higher the number, the more accurate the segmentation is considered. Each image segmentation model has distinct characteristics, and the accuracy may vary for a specific class. Therefore, even if the average IoU is high, certain classes may have a lower IoU than other models. In the field of research dealing with time series data, there is a case for building a hybrid model that actively adopts the advantages of several models [14,15]. Based on these cases, in this paper, we propose a hybrid segmentation method to accurately predict unlabeled GSV images by exploiting the unique strengths of different segmentation models. Specifically, we built a hybrid model using the SegNet [16,17] and DeepLabv3+ [18] models. In general, DeepLab has a higher IoU than SegNet, but SegNet’s results are often more accurate than DeepLab’s for contours of relatively small and complex objects such as people and cars. On the other hand, DeepLab has high segmentation accuracy for large objects such as roads, buildings, and the sky. Our hybrid model leverages these characteristics by adopting SegNet’s results when predicting cars and people and DeepLab’s results when predicting roads, buildings, and the sky. We also employ the weighted sum method to improve the accuracy of classes that show a similar performance in the two models. Moreover, instead of using a publicly available pre-trained model, we pre-processed the cityscapes dataset to closely resemble the GSV dataset and re-trained the model for analysis.

The main contributions of this paper are as follows:

We propose a hybrid model that combines the results of multiple segmentation models to accurately segment unlabeled GSV (Google Street View) images.
We adopt a novel approach to training the segmentation model by completely re-training it after pre-processing the cityscapes dataset in a manner that closely resembles GSV data. This approach differs from the conventional method of using a pre-trained model.
We enhance the accuracy of the segmentation model by using a weighted sum approach for classes that exhibit similar performance in the two models. These contributions enable the development of more effective and efficient techniques for analyzing urban environments using GSV images.

The remainder of this paper is organized as follows: Section 2 provides a review of related studies, while Section 3 provides a detailed description of our proposal. Section 4 presents the experiments and results used to evaluate our proposed approach, and Section 5 concludes this paper.

2. Related Works

2.1. Green Area Measuring

Recent studies have increasingly applied deep learning techniques to analyze green spaces in urban areas, predominantly utilizing Google Street View (GSV) imagery to determine the amount of green space in each image. Researchers such as Li et al. [19], Lu et al. [20], Seiferling et al. [21], Wang et al. [22], and Yin and Wang [3] have demonstrated the potential of GSV for evaluating the tree cover of streets in Manhattan by computing Green View Indexing (GVI) in each image and extracting the pixels occupied by plants. However, most studies using GSV have only measured GVI without considering trees, shrubs, and lawns.

Some recent studies have attempted to overcome these limitations by classifying different tree species. Zarrin developed a new strategy for detecting various tree species based on their leaves [23]. Furthermore, Sun et al. proposed a method for classifying the type of vegetation (i.e., tree, low-lying vegetation, grass) in street view images [24]. More recently, Choi et al. utilized a semantic segmentation algorithm and graphical analysis to estimate tree profile parameters by determining the relative location of the interface of trees and the ground surface [12]. However, these studies still face limitations as they fail to properly consider the morphological and phenological characteristics of each tree species.

To compute the amount of green vegetation in a given area, researchers have employed two primary methods: (1) color band methods, which extract information based on pixel color, and (2) semantic segmentation techniques, which distinguish between natural greenery and non-vegetated surfaces, such as buildings and roads. The resulting GVI is typically expressed as a percentage or numerical score that reflects the proportion of green vegetation in a specific area.

2.2. Image Segmentation

Long et al. [25] proposed a method for semantic segmentation using fully convolutional networks, which replaced the fully connected layer used in general CNN models for image classification with a convolutional layer for pixel-level classification. They also introduced a skip layer to improve accuracy during the up-sampling process. Similarly, Badrinarayanan et al. [16,17] developed SegNet, which combines the advantages of DeconvNet [26] and U-Net [27]. SegNet uses pooling indices instead of copying and cropping the entire feature to improve memory efficiency, and removes the fully connected layer used in DeconvNet to reduce parameterization.

Another notable method is DeepLab [28], proposed by Chen et al., which uses atrous convolution-based semantic segmentation architecture and atrous spatial pyramid pooling (ASPP) [18,29,30] to improve the architecture. The authors proposed Panoptic DeepLab [31], which transforms the ASPP and decoder of DeepLabv3+ [18] into a dual form, with each decoder producing semantic and instance information as outputs for panoptic segmentation.

Cheng et al. proposed Mask2Former [32] for universal image segmentation, which modifies the architecture of vision transformer [33] by applying transformer [34] and BERT [35] to computer vision. Mask2Former extracts localized features by constraining cross-attention within predicted mask regions. Kabilan et al. [36] improved segmentation accuracy and reduced complexity by using a three-step segmentation process involving the analysis of key components, mapping of similar objects in a faster way, and segmenting similar areas through color mapping.

Semi-supervised learning is often used to solve the problem of insufficient labeled data, but it has a lazy mimicking problem. To address this issue, Huo et al. [37] proposed ATSO, a model that partitions unlabeled training data into two subsets and alternately uses one subset to fine-tune the model, updating labels on the other subset.

In this study, we aim to accurately segment unlabeled images using SegNet and DeepLabv3+ due to their relatively low parameterization and widespread use in image segmentation applications.

2.3. Hybrid and Fusion Scheme

There have been numerous studies on applying hybrid and fusion methods for the segmentation of urban and satellite images. Li et al. proposed a hybrid convolutional network (HCN) comprising U-Net and VGG sub-networks, which was applied for road segmentation [38]. Wang et al. developed a remote sensing image segmentation method using a hybrid method (division and merge) [39]. Khoshboresh et al. proposed a novel hybrid method that combines deep convolutional neural networks and a restricted Boltzmann machine (RBM) to take advantage of the semantic segmentation of high-resolution airborne imagery for automatic building detection [40]. Sun et al. proposed a novel RGB and thermal data fusion network called FuseSeg, which achieved superior performance in the semantic segmentation of urban scenes [41]. Khan et al. developed a hybrid deep learning model that combines the benefits of two deep models, i.e., DenseNet and U-Net [42]. Niu et al. proposed a novel attention-based framework named hybrid multiple attention network (HMANet) that adaptively captures global correlations from the perspective of space, channel, and category in a more effective and efficient manner [43]. Abdollahi et al. introduced two new deep convolutional models, the multilevel context gate UNet (MCg-UNet) and the bidirectional ConvLSTM UNet model (BCL-UNet), based on the UNet family for multi-object segmentation such as roads and buildings in aerial images [44]. Chen et al. presented a pipeline of hybrid supervision that designs auxiliary segmentation models using boundary box attention modules and boundary box filter modules [45].

Various deep learning models have been proposed to address the problem of semantic image segmentation, leveraging multiple information sources to achieve improved performance. For instance, Zhang et al. presented a hybrid deep neural network that combines a transformer and CNN for the semantic segmentation of very high-resolution remote sensing imagery [46]. Another study by Luo et al. introduced a hybrid convolutional neural network (H-ConvNet) to improve urban land cover mapping with MSR Sentinel-2 images [47]. Li et al. proposed a novel hybrid contrastive regularization (HybridCR) framework in a weakly supervised setting, which obtained competitive performance compared to its fully supervised counterpart [48]. Hossain et al. proposed a hybrid segmentation method with modifications such as using the reference polygon to identify optimal parameters and a donut-filling technique to reduce over-segmentation caused by roof elements and illumination differences [49].

Other models have leveraged multimodal fusion to achieve optimal joint predictions. For example, Valdez-Rodríguez et al. proposed a hybrid 2D-3D CNN architecture capable of obtaining semantic segmentation and depth estimation simultaneously [50]. Wang et al. presented a Bilateral Awareness Network that fully captures long-range relationships and fine-grained details in Very Fine Resolution (VFR) images using a dependency path and a texture path [51]. Men et al. proposed a novel model called Concatenated Residual Attention UNet (CRAUNet), which combines the residual structure and channel attention mechanism [52]. Another study by Wang et al. introduced a Transformer-based decoder and constructed a UNet-like Transformer (UNetFormer) for real-time urban scene segmentation [53]. Finally, to take advantage of both CNN and Transformer, a novel Adaptive Enhanced Swin Transformer with U-Net (AESwin-UNet) was proposed for remote sensing segmentation [53,54].

3. Proposed Method

This paper presents a novel hybrid model designed to accurately segment unlabeled Google Street View (GSV) image data for measuring the green area ratio. In order to achieve this, the necessary classes for green area measurement were defined as cars, people, sky, buildings, roads, and plants, while all other classes were treated as background. Figure 1 provides an overview of the proposed technique, which begins with pre-processing to minimize differences between the cityscapes dataset and the collected GSV image dataset. In this work, image size and aspect ratio, which are the biggest differences between the two datasets, were unified through image cropping. Next, independent SegNet and DeepLabv3+ models were trained using the pre-processed data. Input images were then given to each of the two models, and car, people, and plant classes were extracted from the SegNet branch, while road, sidewalk, plant, sky, and building classes were extracted from the DeepLabv3+ branch. The plant class was extracted from both models, and the final class was obtained using a weighted sum approach. In the following sections, each method is described in detail, including the structural features of SegNet and DeepLabv3+ in relation to our dataset.

3.1. Pre-Processing

In this study, Yongsan-gu, Seoul, South Korea, was chosen as an area of interest due to its numerous alleys and the widely distributed street environment around Yongsan Park. To identify pedestrian-friendly roads in this area, we collected 21,100 sheets of Yong-san-gu GSV data. The collected data was in the form of pure images without labels, and each image had a size of 256 by 256. To use this data for the purposes of training and testing our models, we performed pre-processing. We utilized the cityscapes dataset, an image segmentation dataset with a total of 5000 image-label pairs and a size of 2048 by 1024, to identify the required classes, such as plants in the sky of buildings for green area measurement. We also modified the size and aspect ratios of the collected images to match the cityscapes dataset. To achieve this, we used the sliding window method, as shown in Figure 2a, to crop the images with a stride of 32 pixels. By applying this method, we obtained a total of 2048 cropped images from one original image as shown in Figure 2b. However, since the window size was much smaller than the original image, there were cases where only a specific object was included in a scene, as shown in Figure 2c. To address this issue, we resized the original image to ½ and ¼ sizes and repeated the cropping process. In the case of ½ size, the image size is 1024 by 512 in width and height. If the above operation is repeated on this image under the same conditions, 512 additional images can be obtained. Similarly, in the case of ¼ size, 128 additional images can be obtained. As a result, a total of 2688 cropped images are created from one cityscapes image. After obtaining the cropped images, we filtered out images that contained meaningless information. As shown in Figure 2c, if the proportion of the background was over 80% and one of the sky, road, sidewalk, or building classes was over 80%, we excluded the image from the dataset. Following the filtering process, we constructed a new dataset consisting of a total of 2,066,043 images, which were split into 80% for training and 20% for testing. Figure 3 shows some examples of the new dataset. Figure 3a is an example of images, and Figure 3b is an example of labels.

In summary, this study utilized GSV data from Yongsan-gu, Seoul, South Korea, and pre-processed the data using the sliding window method and filtering to construct a new dataset for the purposes of training and testing our models.

3.2. Base Models

This section discusses the SegNet and DeepLabv3+ models used in the hybrid model constructed for image segmentation. SegNet is a symmetric image segmentation model with equal depth in both the encoder and decoder. It reduces the number of parameters while maintaining accuracy through the use of pooling indices. Typically, VGG16 is used as a backbone for SegNet, and this paper also employs VGG16 as the backbone for the SegNet model. On the other hand, DeepLabv3+ is a U-Net-based model that expands the receptive field by using atrous convolution, depthwise convolution, and pointwise convolution, while reducing computation. It is also known for its effectiveness in capturing global features using atrous spatial pyramid pooling (ASPP). In this study, we utilized Inception ResNetv2 as the backbone for the DeepLabv3+ model.

We trained the initialized SegNet and DeepLabv3+ models on the pre-processed cityscapes dataset, treating all classes except for the 7 specified classes as background. The hyperparameters used for training the model were as follows: a mini-batch size of 128, a maximum epoch of 300, and an initial learning rate of 0.001, which was reduced by a factor of 0.3 every 30 epochs. Both models used weighted pixel-wise cross-entropy loss, with weights calculated using median frequency balancing.

Although DeepLabv3+ is known to have better overall segmentation accuracy, in the case of GSV image segmentation, SegNet performed better for specific classes. Specifically, the SegNet model outperformed DeepLabv3+ for the human and car classes. Table 1 provides the segmentation results of SegNet and DeepLabv3+ on the cityscapes dataset.

As presented in the table, DeepLabv3+ exhibits superior segmentation accuracy overall, with the exception of the human and car classes where SegNet performs better. This outcome is likely due to the size of the input image. DeepLabv3+ prioritizes expanding the receptive field and capturing global features using techniques such as atrous convolution and ASPP. While these methods substantially enhance overall scene comprehension and increase average IoU, they tend to lose shape information for small-sized objects. In contrast, SegNet’s pooling indices deliver information corresponding to all output strides to the decoder, which is beneficial for detecting precise object shapes. Both models yield similar segmentation accuracy for the plant class. Accordingly, a post-processing technique is employed to ensure accurate classification for this class, as elaborated in the subsequent section.

3.3. Hybrid Model

In the image segmentation model, the class of a specific pixel is defined as follows:

c l a s s = {}_{c}^{a r g m a x}{o u t p u t s}

Here

o u t p u t s \in ℝ^{H \times W \times C}

represents the last layer of the model, where

H

,

W

, and

C

denote the width, height, and number of channels, respectively. For each channel, a value is assigned to each coordinate

(i, j)

of the image, and the class of the corresponding pixel is determined as the class corresponding to the channel with the highest value among these values.

In the previous section, we observed that SegNet and DeepLabv3+ exhibit varying segmentation accuracies across different classes. To address this issue, we proposed a hybrid model that leverages the strengths of both models to improve overall segmentation accuracy. The final segmentation map is generated using the following approach. Firstly, the pixels corresponding to building, road, sidewalk, and sky classes are filled based on the results of DeepLabv3+. Subsequently, the pixels corresponding to people and car classes are filled based on the results of SegNet. However, conflicts may arise between the two model outputs. For instance, SegNet may predict a specific pixel as a car class while DeepLabv3+ may predict it as a road class. In such cases, we use the following equation to select the pixel class.

w_{s e g} = α_{11} \times {o u t p u t s}_{s e g} (i, j, c_{s e g}) + α_{12} \times {o u t p u t s}_{D L} (i, j, c_{s e g}) w_{D L} = α_{21} \times {o u t p u t s}_{s e g} (i, j, c_{D L}) + α_{22} \times {o u t p u t s}_{D L} (i, j, c_{D L}) α_{11} = \frac{{I o U}_{s e g} (c_{s e g})}{{I o U}_{s e g} (c_{s e g}) + {I o U}_{D L} (c_{D L})}, α_{12} = \frac{{I o U}_{D L} (c_{s e g})}{{I o U}_{s e g} (c_{s e g}) + {I o U}_{D L} (c_{D L})} α_{21} = \frac{{I o U}_{s e g} (c_{D L})}{{I o U}_{s e g} (c_{s e g}) + {I o U}_{D L} (c_{D L})}, α_{22} = \frac{{I o U}_{D L} (c_{D L})}{{I o U}_{s e g} (c_{s e g}) + {I o U}_{D L} (c_{D L})}

(1)

Here C_seg and C_DL are classes predicted by SegNet and DeepLabv3+, respectively, and IoU_seg and IoU_DL are output feature maps of SegNet and DeepLabv3+, respectively. α is the normalized weight of each class. The class corresponding to the larger value among the values w_seg and w_dl calibrated through the formula is selected as the final class. For plant classes where both models showed similar accuracies, pixels predicted by both models to plant are designated as plant classes, and if one of the two models predicts a class other than plants, the final class is selected according to the above rules.

4. Experiments

The main objective of this paper is to develop an accurate segmentation method for Google Street View (GSV) images without ground truths and to analyze the results for green area measurement. To assess the performance of the proposed method, we conducted two experiments.

First, we compared the performance of the proposed hybrid model with other existing models using the cityscapes test dataset. In this experiment, we evaluated the Intersection over Union (IoU) scores of the segmentation results and compared them with those of SegNet and DeepLabv3+. This experiment aimed to demonstrate the superiority of our proposed method over the existing models in terms of segmentation accuracy.

In the second experiment, we visually compared the segmentation results of the proposed method with those of the other models using the GSV dataset. To conduct this comparison, we used a user study in which several people evaluated and compared the segmentation results. This experiment aimed to provide a qualitative assessment of the performance of the proposed method.

4.1. Evaluation Results for Cityscapes Dataset

In the previous section, we evaluated the segmentation results of SegNet and DeepLabv3+ on the Google Street View (GSV) dataset. In this section, we focus on the segmentation results of our proposed Hybrid model for the cityscapes dataset. We evaluate the segmentation results based on Intersection over Union (IoU) scores, and the results are summarized in Table 2.

The Hybrid column in Table 2 shows the IoU scores of our proposed hybrid model, while the Compare to SegNet and Compare to DeepLabv3+ columns indicate how much the IoU has changed when the hybrid model is compared to SegNet or DeepLabv3+, respectively. Here, the + and − signs indicate that IoU has increased or decreased compared to the SegNet or DeepLabv3+ results. The results demonstrate that our proposed hybrid model achieved improved IoU scores for most classes compared to SegNet and DeepLabv3+. However, for DeepLabv3+’s Background class, there was a slight decrease in IoU.

Our ultimate goal is to use these segmentation results to calculate the green coverage rate, and the improved IoU scores across classes make the hybrid model more suitable than the existing models for green coverage rate analysis. To provide a visual comparison of the segmentation results, we present some examples from the cityscapes dataset in Figure 4. The first row shows the original images, the second and third rows show the segmentation results of SegNet and DeepLabv3+, respectively, the fourth row presents the segmentation results of our proposed hybrid model, and the last row displays the ground truths. Note that ground truth contains all the classes provided by the cityscapes dataset.

Visually, we can observe that the segmentation results of our proposed hybrid model are most similar to the ground truths. The result of the hybrid model is similar to DeepLabv3+ with a high IoU overall, but it can be seen that the accuracy of some areas is noticeably improved. As a representative example, the accuracy of the person class in the second example has been greatly improved. Therefore, the improved segmentation performance of our hybrid model over the existing models, along with its suitability for green coverage rate analysis, demonstrates its potential for practical applications.

4.2. Evaluation Results for GSV Images

In this section, we present the results of our segmentation performance analysis for three methods—SegNet, DeepLab, and our proposed hybrid technique—for Google Street View (GSV) images. As the GSV dataset lacks ground truth labels, we conducted a user study with 20 participants holding a bachelor’s degree or higher. We selected 30 images for evaluation, and the participants were provided with the segmentation results of the three methods for each image.

The participants were asked to select the image with the best segmentation and the image with the second-best segmentation. We assigned five points to the best segmentation result, three points to the second-best result, and one point to the remaining results. Table 3 and Table 4 present the results of the user study.

Table 3 shows the scores obtained by each method. Our proposed hybrid method achieved the highest score, followed by DeepLabv3+, while SegNet scored the lowest. The higher score of DeepLabv3+ compared to SegNet is due to its superior accuracy in analyzing classes such as roads, buildings, and sky.

Table 4 shows the ratio of each method obtaining five points, three points, and one point out of the total data. Our proposed hybrid method was selected as the most accurate segmentation (achieving a score of 5) in approximately 77% of the data, while only 3.3% of the cases were the least accurate (achieving a score of 1). These results demonstrate that our proposed method outperforms the other two methods in terms of segmentation performance. Figure 5 is part of the GSV image segmentation results. It can be confirmed that the resulting image has some inaccurately predicted regions, such as walls and roads, but visually plausible segmentation for important classes was achieved.

When combined with the quantitative evaluation, the user study results confirm the superiority of our proposed hybrid method in improving the segmentation performance both quantitatively and qualitatively. Therefore, we can conclude that our hybrid method is a promising approach for the segmentation of GSV images.

5. Conclusions

This paper presents a novel approach for accurately segmenting unlabeled Google Street View (GSV) image data in order to analyze green coverage in urban areas. The proposed method utilizes the cityscapes dataset for model training and applies a pre-processing step that is similar to the GSV image data. The segmentation accuracy is enhanced by training several models and reconstructing them in a hybrid form. The proposed method is quantitatively evaluated using the cityscapes test dataset, and the accuracy of GSV segmentation is assessed through a user study. The evaluation results demonstrate that the Intersection over Union (IoU) improves significantly in most classes, and our method achieves the most accurate segmentation according to human observers. The hybrid technique employed in our approach outperforms existing single-model methods. In future research, we will construct a hybrid model using more diverse and up-to-date models, and use this to segment unlabeled images more elaborately. Additionally, based on this technique, we will try to analyze green area ratio to construct and recommend an optimized walking route.

Author Contributions

Conceptualization, S.L.; Data curation, H.K.; Formal analysis, J.H.L.; Funding acquisition, H.K.; Investigation, J.H.L.; Methodology, S.L.; Project administration, S.L.; Resources, H.K.; Software, H.K.; Supervision, S.L.; Validation, J.H.L.; Visualization, H.K.; Writing—original draft, H.K.; Writing—review and editing, S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported by the Semyung University Research Grant of 2022.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Rousselet, J.; Imbert, C.E.; Dekri, A.; Garcia, J.; Goussard, F.; Vincent, B.; Rossi, J.P. Assessing species distribution using Google Street View: A pilot study with the pine processionary moth. PLoS ONE 2013, 8, e74918. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Rzotkiewicz, A.; Pearson, A.L.; Dougherty, B.V.; Shortridge, A.; Wilson, N. Systematic review of the use of Google Street View in health research: Major themes, strengths, weaknesses and possibilities for future research. Health Place 2018, 52, 240–246. [Google Scholar] [CrossRef]
Yin, L.; Cheng, Q.; Wang, Z.; Shao, Z. ‘Big data’ for pedestrian volume: Exploring the use of Google Street View images for pedestrian counts. Appl. Geogr. 2015, 63, 337–345. [Google Scholar] [CrossRef]
Berland, A.; Lange, D.A. Google Street View shows promise for virtual street tree surveys. Urban For. Urban Green. 2017, 21, 11–15. [Google Scholar] [CrossRef]
Liu, D.; Jiang, Y.; Wang, R.; Lu, Y. Establishing a citywide street tree inventory with street view images and computer vision techniques. Computers. Environ. Urban Syst. 2023, 100, 101924. [Google Scholar] [CrossRef]
Gupta, K.; Kumar, P.; Pathan, S.K.; Sharma, K.P. Urban Neighborhood Green Index–A measure of green spaces in urban areas. Landsc. Urban Plan. 2012, 105, 325–335. [Google Scholar] [CrossRef]
Kim, J.H.; Lee, S.; Hipp, J.R.; Ki, D. Decoding urban landscapes: Google street view and measurement sensitivity. Comput. Environ. Urban Syst. 2021, 88, 101626. [Google Scholar] [CrossRef]
Rundle, A.G.; Bader, M.D.; Richards, C.A.; Neckerman, K.M.; Teitler, J.O. Using Google Street View to audit neighborhood environments. Am. J. Prev. Med. 2011, 40, 94–100. [Google Scholar] [CrossRef] [Green Version]
Lu, Y.; Sarkar, C.; Xiao, Y. The effect of street-level greenery on walking behavior: Evidence from Hong Kong. Soc. Sci. Med. 2018, 208, 41–49. [Google Scholar] [CrossRef] [PubMed]
Ye, Y.; Richards, D.; Lu, Y.; Song, X.; Zhuang, Y.; Zeng, W.; Zhong, T. Measuring daily accessed street greenery: A human-scale approach for informing better urban planning practices. Landsc. Urban Plan. 2019, 191, 103434. [Google Scholar] [CrossRef]
Ki, D.; Lee, S. Analyzing the effects of Green View Index of neighborhood streets on walking time using Google Street View and deep learning. Landsc. Urban Plan. 2021, 205, 103920. [Google Scholar] [CrossRef]
Choi, K.; Lim, W.; Chang, B.; Jeong, J.; Kim, I.; Park, C.R.; Ko, D.W. An automatic approach for tree species detection and profile estimation of urban street trees using deep learning and Google street view images. ISPRS J. Photogramm. Remote Sens. 2022, 190, 165–180. [Google Scholar] [CrossRef]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Moon, J.; Park, S.; Rho, S.; Hwang, E. Robust building energy consumption forecasting using an online learning approach with R ranger. J. Build. Eng. 2022, 47, 103851. [Google Scholar] [CrossRef]
Rew, J.; Cho, Y.; Moon, J.; Hwang, E. Habitat suitability estimation using a two-stage ensemble approach. Remote Sens. 2020, 12, 1475. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Handa, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. arXiv 2015, arXiv:1505.07293. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder–decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the c European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Li, X.; Zhang, C.; Li, W.; Ricard, R.; Meng, Q.; Zhang, W. Assessing street-level urban greenery using Google Street View and a modified green view index. Urban For. Urban Green. 2015, 14, 675–685. [Google Scholar] [CrossRef]
Lu, Y.; Yang, Y.; Sun, G.; Gou, Z. Associations between overhead-view and eye-level urban greenness and cycling behaviors. Cities 2019, 88, 10–18. [Google Scholar] [CrossRef]
Seiferling, I.; Naik, N.; Ratti, C.; Proulx, R. Green streets−Quantifying and mapping urban trees with street-level imagery and computer vision. Landsc. Urban Plan. 2017, 165, 93–101. [Google Scholar] [CrossRef]
Wang, R.; Lu, Y.; Zhang, J.; Liu, P.; Yao, Y.; Liu, Y. The relationship between visual enclosure for neighbourhood street walkability and elders’ mental health in China: Using street view images. J. Transp. Health 2019, 13, 90–102. [Google Scholar] [CrossRef]
Zarrin, I. Leaf based trees identification using convolutional neural network. In Proceedings of the 2019 IEEE 5th International Conference for Convergence in Technology (I2CT), Bombay, India, 29–31 March 2019; pp. 1–4. [Google Scholar]
Sun, Y.; Wang, X.; Zhu, J.; Chen, L.; Jia, Y.; Lawrence, J.M.; Wu, J. Using machine learning to examine street green space types at a high spatial resolution: Application in Los Angeles County on socioeconomic disparities in exposure. Sci. Total Environ. 2021, 787, 147653. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–15 June 2015; pp. 3431–3440. [Google Scholar]
Noh, H.; Hong, S.; Han, B. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1520–1528. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [Green Version]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Bowen, C.; Maxwell, D.C.; Yukun, Z.; Ting, L.; Thomas, S.H.; Hartwig, A.; Chen, L.-C. Panoptic-DeepLab. arXiv 2019, arXiv:1910.04751. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 1290–1299. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Kabilan, R.; Devaraj, G.P.; Muthuraman, U.; Muthukumaran, N.; Gabriel, J.Z.; Swetha, R. Efficient color image segmentation using fastmap algorithm. In Proceedings of the 2021 Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV), Tirunelveli, India, 4–6 February 2021; pp. 1134–1141. [Google Scholar]
Huo, X.; Xie, L.; He, J.; Yang, Z.; Zhou, W.; Li, H.; Tian, Q. ATSO: Asynchronous teacher-student optimization for semi-supervised image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 1235–1244. [Google Scholar]
Li, Y.; Guo, L.; Rao, J.; Xu, L.; Jin, S. Road segmentation based on hybrid convolutional network for high-resolution visible remote sensing image. IEEE Geosci. Remote Sens. Lett. 2018, 16, 613–617. [Google Scholar] [CrossRef]
Wang, J.; Jiang, L.; Wang, Y.; Qi, Q. An improved hybrid segmentation method for remote sensing images. ISPRS Int. J. Geo-Inf. 2019, 8, 543. [Google Scholar] [CrossRef] [Green Version]
Khoshboresh Masouleh, M.; Shah-Hosseini, R. A hybrid deep learning–based model for automatic car extraction from high-resolution airborne imagery. Appl. Geomat. 2020, 12, 107–119. [Google Scholar] [CrossRef]
Sun, Y.; Zuo, W.; Yun, P.; Wang, H.; Liu, M. FuseSeg: Semantic segmentation of urban scenes based on RGB and thermal data fusion. IEEE Trans. Autom. Sci. Eng. 2020, 18, 1000–1011. [Google Scholar] [CrossRef]
Khan, S.D.; Alarabi, L.; Basalamah, S. Deep hybrid network for land cover semantic segmentation in high-spatial resolution satellite images. Information 2021, 12, 230. [Google Scholar] [CrossRef]
Niu, R.; Sun, X.; Tian, Y.; Diao, W.; Chen, K.; Fu, K. Hybrid multiple attention network for semantic segmentation in aerial images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–18. [Google Scholar] [CrossRef]
Abdollahi, A.; Pradhan, B.; Shukla, N.; Chakraborty, S.; Alamri, A. Multi-object segmentation in complex urban scenes from high-resolution remote sensing data. Remote Sens. 2021, 13, 3710. [Google Scholar] [CrossRef]
Chen, L.; Fu, Y.; You, S.; Liu, H. Efficient hybrid supervision for instance segmentation in aerial images. Remote Sens. 2021, 13, 252. [Google Scholar] [CrossRef]
Zhang, C.; Jiang, W.; Zhang, Y.; Wang, W.; Zhao, Q.; Wang, C. Transformer and CNN hybrid deep neural network for semantic segmentation of very-high-resolution remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–20. [Google Scholar] [CrossRef]
Luo, X.; Tong, X.; Hu, Z.; Wu, G. Improving urban land cover/use mapping by integrating a hybrid convolutional neural network and an automatic training sample expanding strategy. Remote Sens. 2020, 12, 2292. [Google Scholar] [CrossRef]
Li, M.; Xie, Y.; Shen, Y.; Ke, B.; Qiao, R.; Ren, B.; Lin, S.; Ma, L. Hybridcr: Weakly-supervised 3d point cloud semantic segmentation via hybrid contrastive regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 14930–14939. [Google Scholar]
Hossain, M.D.; Chen, D. A hybrid image segmentation method for building extraction from high-resolution RGB images. ISPRS J. Photogramm. Remote Sens. 2022, 192, 299–314. [Google Scholar] [CrossRef]
Valdez-Rodríguez, J.E.; Calvo, H.; Felipe-Riverón, E.; Moreno-Armendáriz, M.A. Improving depth estimation by embedding semantic segmentation: A hybrid CNN model. Sensors 2022, 22, 1669. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Wang, D.; Duan, C.; Wang, T.; Meng, X. Transformer meets convolution: A bilateral awareness network for semantic segmentation of very fine resolution urban scene images. Remote Sens. 2021, 13, 3065. [Google Scholar] [CrossRef]
Men, G.; He, G.; Wang, G. Concatenated Residual Attention UNet for Semantic Segmentation of Urban Green Space. Forests 2021, 12, 1441. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Gu, X.; Li, S.; Ren, S.; Zheng, H.; Fan, C.; Xu, H. Adaptive enhanced swin transformer with U-net for remote sensing image segmentation. Comput. Electr. Eng. 2022, 102, 108223. [Google Scholar] [CrossRef]

Figure 1. Overall flowchart.

Figure 2. Example of data pre-processing.

Figure 3. Sample of constructed new dataset.

Figure 4. Semantic segmentation for cityscapes dataset.

Figure 5. Segmentation results for GSV images.

Table 1. Segmentation result for the cityscapes dataset.

Evaluation Metric: Intersection over Union (IoU)
Classes	Model
Classes	SegNet	DeepLabv3+
People	0.699	0.657
Car	0.681	0.658
Plant	0.702	0.706
Sky	0.705	0.792
Building	0.691	0.813
Road	0.705	0.796
Sidewalk	0.721	0.809
Background	0.760	0.814
Average	0.708	0.756

Table 2. Segmentation results for the cityscapes test dataset.

Evaluation Metric: Intersection over Union (IoU)
Classes	Results
Classes	Hybrid	Compare to SegNet	Compare to DeepLabv3+
People	0.706	+0.007	+0.049
Car	0.717	+0.036	+0.059
Plant	0.715	+0.013	+0.009
Sky	0.813	+0.108	+0.021
Building	0.815	+0.124	+0.002
Road	0.801	+0.096	+0.005
Sidewalk	0.812	+0.091	+0.003
Background	0.808	+0.048	−0.006
Average	0.773	+0.065	+0.018

Table 3. User study result.

Model	Score
SegNet	58
DeepLabv3+	78
Ensemble (Ours)	134

Table 4. Percentage of images with each score for each model.

Model	Score
Model	5	3	1
SegNet	10.00%	26.67%	63.33%
DeepLabv3+	13.33%	53.33%	33.33%
Ensemble (Ours)	76.67%	20.00%	3.33%
The number of corresponding scored images per total number of images

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, H.; Lee, J.H.; Lee, S. A Hybrid Image Segmentation Method for Accurate Measurement of Urban Environments. Electronics 2023, 12, 1845. https://doi.org/10.3390/electronics12081845

AMA Style

Kim H, Lee JH, Lee S. A Hybrid Image Segmentation Method for Accurate Measurement of Urban Environments. Electronics. 2023; 12(8):1845. https://doi.org/10.3390/electronics12081845

Chicago/Turabian Style

Kim, Hyungjoon, Jae Ho Lee, and Suan Lee. 2023. "A Hybrid Image Segmentation Method for Accurate Measurement of Urban Environments" Electronics 12, no. 8: 1845. https://doi.org/10.3390/electronics12081845

APA Style

Kim, H., Lee, J. H., & Lee, S. (2023). A Hybrid Image Segmentation Method for Accurate Measurement of Urban Environments. Electronics, 12(8), 1845. https://doi.org/10.3390/electronics12081845

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Hybrid Image Segmentation Method for Accurate Measurement of Urban Environments

Abstract

1. Introduction

2. Related Works

2.1. Green Area Measuring

2.2. Image Segmentation

2.3. Hybrid and Fusion Scheme

3. Proposed Method

3.1. Pre-Processing

3.2. Base Models

3.3. Hybrid Model

4. Experiments

4.1. Evaluation Results for Cityscapes Dataset

4.2. Evaluation Results for GSV Images

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI