Identifying a Slums’ Degree of Deprivation from VHR Images Using Convolutional Neural Networks

: In the cities of the Global South, slum settlements are growing in size and number, but their locations and characteristics are often missing in o ﬃ cial statistics and maps. Although several studies have focused on detecting slums from satellite images, only a few captured their variations. This study addresses this gap using an integrated approach that can identify a slums’ degree of deprivation in terms of socio-economic variability in Bangalore, India using image features derived from very high resolution (VHR) satellite images. To characterize deprivation, we use multiple correspondence analysis (MCA) and quantify deprivation with a data-driven index of multiple deprivation (DIMD). We take advantage of spatial features learned by a convolutional neural network (CNN) from VHR satellite images to predict the DIMD. To deal with a small training dataset of only 121 samples with known DIMD values, insu ﬃ cient to train a deep CNN, we conduct a two-step transfer learning approach using 1461 delineated slum boundaries as follows. First, a CNN is trained using these samples to classify slums and formal areas. The trained network is then ﬁne-tuned using the 121 samples to directly predict the DIMD. The best prediction is obtained by using an ensemble non-linear regression model, combining the results of the CNN and models based on hand-crafted and geographic information system (GIS) features, with R 2 of 0.75. Our ﬁndings show that using the proposed two-step transfer learning approach, a deep CNN can be trained with a limited number of samples to predict the slums’ degree of deprivation. This demonstrates that the CNN-based approach can capture variations of deprivation in VHR images, providing a comprehensive understanding of the socio-economic situation of slums in Bangalore.


Introduction
Presently, the majority of people live in urban areas, and the UN estimates that the proportion of urban dwellers will increase from 54% in 2014 to 66% by 2050 [1]. Most of this urban growth is expected to happen in developing countries, in particular in Asia and Africa [1]. This challenges governments and planners in these countries who often have insufficient resources to provide adequate housing and basic services to all inhabitants [2]. Currently, urban poverty mostly brings about the emergence and expansion of slum areas, which offer sub-standard shelters for the growing urban population [3].
Slum dwellers are approximately one-quarter of the total urban population [4]. UN-Habitat [5] defines such areas as places deprived of at least one of the following five elements: (1) safe water, (2) proper sanitation, (3) durable housing, (4) tenure security, and (5) sufficient living space. However, the diversity of slums limits the codification of general particularities to characterize and transfer learning. Whereas, [19] predicted deprivation using machine learning, but only focused on the living environment as one of the seven deprivation domains of the English index of deprivation [33]. To predict a slum index with four indicators, [7] used a regression model, but the index only covered the physical and financial domains of deprivation. Furthermore, these studies used administrative boundaries to set analytical units, so they analyzed a mixture of the poor and the wealthy in each unit and did not focus on the diversity of deprived areas. To capture the diversity of deprived areas, [10] identified four sub-categories using image-based features. However, these classes were broad, qualitative, and did not include details on socio-economic variations which offers potentials for more investigation.
This study aims to map variations of multi-dimensional deprivation among slum settlements providing novel solutions when working in a data-poor environment. "How to meaningfully quantify and aggregate heterogeneous surveyed data into the slums' degree of deprivation?" is the first research question. To answer this question, this study uses multiple correspondence analysis (MCA) and builds an index based on the Index of Multiple Deprivation (IMD) [34]. We refer to the index as data-driven IMD (DIMD), since the indicators are selected based on the IMD and indicator values are aggregated based on data patterns and MCA. As an advantage over similar studies (e.g., [35]), our method has the potential to be more transferable and less prone to subjectivity. Moreover, collecting data from households is a resource-consuming process which results in limited socio-economic data about slums. This causes a problem for deep learning models as they typically need large datasets to be trained. Therefore, the second question is "How to train a deep CNN to predict slums' deprivation using VHR satellite images based on limited training samples?" To address this question, a two-step CNN-based approach is performed: I) CNN is trained for a binary classification problem of "slum" and "formal areas" using the log-likelihood loss function. II) The learned spatial features are used to predict the DIMD by fine-tuning the CNN with a small training set using the Euclidean loss function. Unlike other studies (e.g., [19]), we use the CNN for DIMD predictions with a unique framework trained end-to-end. Although this method is not a standard way of training and using a CNN, it can take advantage of the feature learning capability of deep learning models for prediction using very few samples. Few available studies on CNNs and slums (e.g., [29]) use small subsets and use the majority of the area for training to predict the small testing area which is unrealistic for real-world applications.
The following section explains the theoretical framework, available data to this study, and the methods used to analyze the data. Section 3 provides the results and Section 4 discusses the main lessons learned. Section 5 concludes on the utility of CNN-based models to predict the DIMD and suggests possible directions for further studies.

Materials and Methods
This study consists of two main steps ( Figure 1). First, it analyzes slums based on the concept of deprivation; it characterizes deprivation and processes available socio-economic data from a household survey (HH) and an in-situ quick scan (QS) survey. Second, it builds models based on image features to predict the DIMD. The final result is obtained using an ensemble regression model that combines the result of the CNN with principal component regression (PCR) models based on hand-crafted and GIS features.
The methodology is applied to the Indian city of Bangalore, with a population of more than 10 million and facing rapid slum growth [35,36]. Bangalore is known as the Silicon Valley of India and it is attracting a considerable amount of investment in the ICT sector [37]. However, citizens do not equally benefit from such investments and the growth of wealth has been accompanied by the growth of poverty and consequently the increase of slums [35].
In Bangalore, a wide range of slum settlements exists, from very temporary and worse-off to more permanent and formal-like. All these settlements can be grouped into two administrative categories: notified slums and non-notified slums. Non-notified slums are mostly worse-off, newer settlements, and they are not officially recognized by the government. The government provides basic services to notified slums as well as upgrading programs, which in some cases made them indistinguishable from The methodology is applied to the Indian city of Bangalore, with a population of more than 10 million and facing rapid slum growth [35,36]. Bangalore is known as the Silicon Valley of India and it is attracting a considerable amount of investment in the ICT sector [37]. However, citizens do not equally benefit from such investments and the growth of wealth has been accompanied by the growth of poverty and consequently the increase of slums [35].
In Bangalore, a wide range of slum settlements exists, from very temporary and worse-off to more permanent and formal-like. All these settlements can be grouped into two administrative categories: notified slums and non-notified slums. Non-notified slums are mostly worse-off, newer settlements, and they are not officially recognized by the government. The government provides basic services to notified slums as well as upgrading programs, which in some cases made them indistinguishable from formal areas [38]. This, together with the availability of remote sensing and reference data, makes it a suitable case for this study.

Conceptualizing Deprivation
Slums are settlements which are deprived in multiple dimensions, such as poor basic service provision and inadequate housing. To measure the degree of deprivation of such settlements, this study sees deprivation as a multi-dimensional phenomenon [34] covering a wide range of socioeconomic and other aspects, which are essential to understanding variations of slum settlements [34]. In the literature, poverty and deprivation have been conceptualized in different ways with a clear shift from one-dimensional approaches looking only at financial aspects, to multi-dimensional approaches [39]. The multi-dimensional poverty index (MPI), introduced by [40], considers health, education, and living standard as three relevant dimensions of poverty. Yet, [41] conceptualized multiple deprivations and defined poverty as the financial aspect of deprivation besides social, environmental, and institutional components. Therefore, regardless of using the term "poverty" or "deprivation", the importance of looking beyond the financial aspects has been widely emphasized. Aspiring to a broader understanding of deprivation, we adapt the deprivation framework from the Methodology to predict slum data-driven index of multiple deprivation (DIMD) values from very high resolution (VHR) images. The first step starts with conceptualizing deprivation followed by analyzing the household data (HH) and data from the quick scan (QS) survey using multiple correspondence analysis (MCA). The second step predicts the QS DIMD values from VHR images using a convolutional neural network (CNN)-based model and principal component regression (PCR) models. The ensemble model is built using the CNN-based model and the PCR models.

Conceptualizing Deprivation
Slums are settlements which are deprived in multiple dimensions, such as poor basic service provision and inadequate housing. To measure the degree of deprivation of such settlements, this study sees deprivation as a multi-dimensional phenomenon [34] covering a wide range of socio-economic and other aspects, which are essential to understanding variations of slum settlements [34]. In the literature, poverty and deprivation have been conceptualized in different ways with a clear shift from one-dimensional approaches looking only at financial aspects, to multi-dimensional approaches [39]. The multi-dimensional poverty index (MPI), introduced by [40], considers health, education, and living standard as three relevant dimensions of poverty. Yet, [41] conceptualized multiple deprivations and defined poverty as the financial aspect of deprivation besides social, environmental, and institutional components. Therefore, regardless of using the term "poverty" or "deprivation", the importance of looking beyond the financial aspects has been widely emphasized. Aspiring to a broader understanding of deprivation, we adapt the deprivation framework from the IMD developed for Indian cities, which is based on the livelihoods approach [42] and covers four main domains of deprivation; financial, human, social, and physical capitals [34]. This framework focuses on households and the dwelling they live in but does not involve the context of the dwellings. Based on related studies and constructed indices (e.g., [2,10,14,43,44]), the contextual domain was added to the IMD-based deprivation framework to create a holistic picture of deprivation levels. The contextual domain involves indicators, which look at spatial neighborhood characteristics of the geographic area in which the dwelling is located, like accessibility to services or environmental characteristics. One reason that such a comprehensive Remote Sens. 2019, 11, 1282 5 of 24 framework is not used in the related studies is data availability to support the framework (e.g., [32]). Figure 2 shows the five domains of deprivation.
Based on related studies and constructed indices (e.g., [2,10,14,43,44]), the contextual domain was added to the IMD-based deprivation framework to create a holistic picture of deprivation levels. The contextual domain involves indicators, which look at spatial neighborhood characteristics of the geographic area in which the dwelling is located, like accessibility to services or environmental characteristics. One reason that such a comprehensive framework is not used in the related studies is data availability to support the framework (e.g., [32]). Figure 2 shows the five domains of deprivation.  [34]. The contextual domain is added by this study to bring information about the spatial context. Remote sensing images and GIS layers can capture information about the physical and contextual domains only. Example indicators in each domain are as follows: income (financial domain), education and health (human domain), caste (social domain), constructing materials (physical domain), accessibility to services (contextual domain).

Available Data
This study uses three main types of data: two sets of socio-economic data, a set of satellite images, and a set of GIS layers. The socio-economic data consists of a set of secondary and a set of primary data. A detailed survey from 1114 households living in 37 notified slums from 2010 (HH data; [45]) is provided by the DynaSlum project [46]. Based on the literature and experts' knowledge, this study selects 16 indicators (mostly categorical; each indicator contains a number of categories, see supplementary materials Section S1 for more details), measuring the five domains of deprivation (Table 1). In addition, the study uses delineated boundaries of 1461 slums from 2017, also provided by DynaSlum. Considering time and resource limitations as well as spatial coverage, primary data about 121 slums were collected. The study calls this primary data collection quick scan (QS) as it is designed in such a way that the surveyor goes to each of the 121 selected slums, observes and documents the surroundings from one location. In this way, the fieldwork covers physical and contextual domains of deprivation for 121 locations, collected within three weeks in August 2017. The dimensions of the QS survey are based on 35 categorical deprivation-related indicators extracted from the literature besides experts' consultation (Table 1) (see supplementary materials Section S2 for more details). The HH and QS data have 26 samples in common (almost 70% of the HH samples) with no significant physical change during the period of 2010 to 2017 (checked on Google Earth). The HH data, which includes indicators from all domains of deprivation, are essential to understand all deprivation components and their variations, while the QS data, which is an up-to-date survey, covers more slum settlements, which is required to build CNN-based models to predict deprivation.  [34]. The contextual domain is added by this study to bring information about the spatial context. Remote sensing images and GIS layers can capture information about the physical and contextual domains only. Example indicators in each domain are as follows: income (financial domain), education and health (human domain), caste (social domain), constructing materials (physical domain), accessibility to services (contextual domain).

Available Data
This study uses three main types of data: two sets of socio-economic data, a set of satellite images, and a set of GIS layers. The socio-economic data consists of a set of secondary and a set of primary data. A detailed survey from 1114 households living in 37 notified slums from 2010 (HH data; [45]) is provided by the DynaSlum project [46]. Based on the literature and experts' knowledge, this study selects 16 indicators (mostly categorical; each indicator contains a number of categories, see Supplementary Materials Section S1 for more details), measuring the five domains of deprivation (Table 1). In addition, the study uses delineated boundaries of 1461 slums from 2017, also provided by DynaSlum. Considering time and resource limitations as well as spatial coverage, primary data about 121 slums were collected. The study calls this primary data collection quick scan (QS) as it is designed in such a way that the surveyor goes to each of the 121 selected slums, observes and documents the surroundings from one location. In this way, the fieldwork covers physical and contextual domains of deprivation for 121 locations, collected within three weeks in August 2017. The dimensions of the QS survey are based on 35 categorical deprivation-related indicators extracted from the literature besides experts' consultation (Table 1) (see Supplementary Materials Section S2 for more details). The HH and QS data have 26 samples in common (almost 70% of the HH samples) with no significant physical change during the period of 2010 to 2017 (checked on Google Earth). The HH data, which includes indicators from all domains of deprivation, are essential to understand all deprivation components and their variations, while the QS data, which is an up-to-date survey, covers more slum settlements, which is required to build CNN-based models to predict deprivation. In addition to socio-economic data, the study uses four Pleiades pansharpened satellite images with a spatial resolution of 0.5 m, containing B, G, R, NIR bands, and zero percent cloud coverage, three from March 2016 and one from March 2015, also acquired within the DynaSlum project. Although one of the images was captured on a different date, it helps to have almost a full coverage of the city and is, therefore, used for the analysis. Figure 3 shows the location of slums and the coverage of satellite images.
Furthermore, the study obtains freely available GIS layers using open street map (OSM) data, to extract layers of land use and urban services. These data are not officially validated, though they provide extensive contextual information. Moreover, the study uses world elevation data deriving from multiple sources [47] and having a resolution of 11 m in Bangalore. The elevation data is publicly provided by the ESRI (Environmental Systems Research Institute; [48]). In addition to socio-economic data, the study uses four Pleiades pansharpened satellite images with a spatial resolution of 0.5 m, containing B, G, R, NIR bands, and zero percent cloud coverage, three from March 2016 and one from March 2015, also acquired within the DynaSlum project. Although one of the images was captured on a different date, it helps to have almost a full coverage of the city and is, therefore, used for the analysis. Figure 3 shows the location of slums and the coverage of satellite images.
Furthermore, the study obtains freely available GIS layers using open street map (OSM) data, to extract layers of land use and urban services. These data are not officially validated, though they provide extensive contextual information. Moreover, the study uses world elevation data deriving from multiple sources [47] and having a resolution of 11 m in Bangalore. The elevation data is publicly provided by the ESRI (Environmental Systems Research Institute; [48]).
where d i,i is the distance between individuals i and i , p k is the proportion of individuals having the category k; y ik and y i k are 1 if the category k belongs to the individual i or i and 0 otherwise. Therefore, two individuals with exactly the same categories have the distance of zero and two individuals sharing many categories have a small distance. In other words, individuals with common categories are located around the origin of the point cloud and individuals with rare categories are located at the periphery. Thus, rarer categories are located farther away, and more common categories are gathered closer to the origin. Finally, this high-dimensional space is projected to a low-dimensional space, keeping the most possible variance and the most important variables (which are called dimensions). This study refers to the first dimension created by MCA as the data-driven index of multiple deprivation (DIMD), which delivers a single deprivation value for each individual (i.e., a household in case of the HH DIMD and a slum in case of the QS DIMD). We only use the first dimension created by MCA as it represents indicators with the highest variabilities among individuals. Furthermore, using a single value makes the result of the analysis more comprehensible.
To analyze the socio-economic data, three main steps are followed. First, we use the HH data (a total of 1114 households living in 37 slums) to build the HH DIMD, to identify the deprivation domains which play the most crucial role in differentiating households, and to analyze to what extent households belonging to one slum are homogeneous. Second, we use the QS data (a total of 121 slums) which focus only on physical and contextual domains of deprivation to build the QS DIMD. Third, we explore the correlation between HH and QS DIMDs to find the meaningfulness of relying only on physical and contextual information to analyze the slums' degree of deprivation in Bangalore. To do this, we use the 26 common samples and compute the Pearson correlation. Samples are bootstrapped 1000 times to derive confidence intervals.
Satellite images are used to predict the QS DIMD solely (and not the HH DIMD) based on two reasons: (1) There are very few HH samples available (i.e., 26 samples), which is an insufficient number to train and validate a CNN, and (2) the available samples were surveyed in 2010, so they are not representative of slums in 2017. Figure 3 shows that HH samples are mostly concentrated in the city center but most of the slums in 2017 are located at the periphery.

Building Image-Based Models to Predict the DIMD
A deep learning approach is used to analyze satellite images based on a CNN. To train a CNN, one of the most important issues is the "number of samples" [50]. In fact, studies usually use tens of thousands of samples to train CNNs (e.g., [51]). As there are only 121 samples with known DIMD values, we take advantage of 1461 delineated slum boundaries to develop a two-step transfer learning approach. We initially train a CNN with the ability to classify "slums" from "formal areas" using slum boundaries. By training such a network, we learn discriminative spatial features to separate slums from formal residential areas (and consequently, to separate more deprived areas from less deprived areas). Next, we transform the trained network to a regression model by changing its objective function from log-likelihood to Euclidean loss function, which changes the behavior of the network to work as a least square regression model. Based on transfer learning, we use the limited number of samples to fine-tune the new CNN parameters and predict the DIMD. Thus, we use our pre-trained network and its learned features to deal with the few samples available for our study. The process of training CNNs is elaborated in the following sections.

Sample and Image Preparation
We initially train a CNN to classify slums from formal areas. Therefore, 1461 delineated slums are checked one by one on top of the images and slum boundaries are corrected where necessary.
We develop the following strategy to introduce samples of formal areas to the model. A set of 250 × 250 m tessellation is generated on the whole area using stratified random sampling, i.e., dividing the area into squares of 4 × 4 km, and randomly select an equal number of tessellations within each square. This helps to reduce the effect of spatial autocorrelation by generating samples throughout the area, also keeping the samples representative by selecting them randomly within each square [52]. Using OSM, commercial and industrial areas are erased from the delineated formal areas. Thus, 611 polygons are prepared as formal samples. A buffer of 150 m is generated around slum samples and is removed from formal areas to avoid confusion when generating patches on polygons as inputs to the CNN (Figure 4a). This allows us to generate patches up to 200 × 200 pixels (100 × 100 m) on slums and formal areas with no overlap (see orange and red patches illustrated in Figure 4a). Figure 4b shows the final slum and formal samples prepared for this study. number of samples to fine-tune the new CNN parameters and predict the DIMD. Thus, we use our pre-trained network and its learned features to deal with the few samples available for our study. The process of training CNNs is elaborated in the following sections.

Sample and Image Preparation
We initially train a CNN to classify slums from formal areas. Therefore, 1461 delineated slums are checked one by one on top of the images and slum boundaries are corrected where necessary.
We develop the following strategy to introduce samples of formal areas to the model. A set of 250 × 250 m tessellation is generated on the whole area using stratified random sampling, i.e., dividing the area into squares of 4 × 4 km, and randomly select an equal number of tessellations within each square. This helps to reduce the effect of spatial autocorrelation by generating samples throughout the area, also keeping the samples representative by selecting them randomly within each square [52]. Using OSM, commercial and industrial areas are erased from the delineated formal areas. Thus, 611 polygons are prepared as formal samples. A buffer of 150 m is generated around slum samples and is removed from formal areas to avoid confusion when generating patches on polygons as inputs to the CNN (Figure 4a). This allows us to generate patches up to 200 × 200 pixels (100 × 100 m) on slums and formal areas with no overlap (see orange and red patches illustrated in Figure 4a). Figure 4b shows the final slum and formal samples prepared for this study. We organize the satellite images for extracting CNN patches. The CNN uses a fixed square patch as input, so one cannot include samples of different sizes as inputs for the same network. However, since slums' sizes vary significantly, we develop our method in such a way that we can keep all slums of different sizes in our analysis. Based on [10], we generate a 20-m buffer around each sample and change all pixel values outside this buffer to zero for two reasons: (1) Many slums are located between formal areas, so extracted features would not exclusively belong to slums. Consequently, this mixture might bring confusion to the classification accuracy and the predictive model. As an example, the orange patch in Figure 4a is a slum patch but can contain a large number of formal areas depending on where the center point of the patch is located. (2) The same patches can be used to build models based on hand-crafted and GIS features, so the output of the two models are more comparable ( Figure  12 shows that some patches have zero values around slums).
Before generating patches from polygons ( Figure 4), we randomly select two-third of our polygon samples for training/validation and one third for testing. Therefore, these two sets are completely independent. Furthermore, for training and validation, we generate two independent point sets as path centers, then extract patches accordingly. We organize the satellite images for extracting CNN patches. The CNN uses a fixed square patch as input, so one cannot include samples of different sizes as inputs for the same network. However, since slums' sizes vary significantly, we develop our method in such a way that we can keep all slums of different sizes in our analysis. Based on [10], we generate a 20-m buffer around each sample and change all pixel values outside this buffer to zero for two reasons: (1) Many slums are located between formal areas, so extracted features would not exclusively belong to slums. Consequently, this mixture might bring confusion to the classification accuracy and the predictive model. As an example, the orange patch in Figure 4a is a slum patch but can contain a large number of formal areas depending on where the center point of the patch is located. (2) The same patches can be used to build models based on hand-crafted and GIS features, so the output of the two models are more comparable ( Figure 12 shows that some patches have zero values around slums).
Before generating patches from polygons ( Figure 4), we randomly select two-third of our polygon samples for training/validation and one third for testing. Therefore, these two sets are completely independent. Furthermore, for training and validation, we generate two independent point sets as path centers, then extract patches accordingly.  To generate patches as input to the CNN, we first generate patch center points on the sample polygons, then extract each patch accordingly. A shallow CNN is trained using 1000 training and 1000 validation patches with patch sizes of 99, 129, and 165 pixels to find the optimal patch size. The shallow network [23] contains two convolutional layers followed by a fully-connected layer and a softmax classifier with a log-likelihood objective function ( Figure 6).
In each convolutional layer, a 2D convolution is performed with shared weights and biases within a kernel as follows: where is the shared bias, , is a × array with shared weights, denotes the activation position within a kernel with the origin of ( , ), and is the rectified linear unit (ReLU) activation function. The process is followed by a max pooling layer with the size of 2 × 2.
Extracted features feed a one-dimensional fully-connected layer followed by a softmax classifier. Using the softmax classifier, in addition to the classification result, the network returns the probability of each patch belonging to each class as follow: where is the probability of class , is the activation value of the corresponding output class, and the denominator is the sum of the probability of all classes ( ). The network is trained by using log-likelihood loss function as follows: where is the number of samples, is the total number of classes, is the true vector, and is the predicted vector by the network. Network parameters are optimized using the stochastic gradient decent method and the backpropagation algorithm [53]. To generate patches as input to the CNN, we first generate patch center points on the sample polygons, then extract each patch accordingly. A shallow CNN is trained using 1000 training and 1000 validation patches with patch sizes of 99, 129, and 165 pixels to find the optimal patch size. The shallow network [23] contains two convolutional layers followed by a fully-connected layer and a softmax classifier with a log-likelihood objective function ( Figure 6).
In each convolutional layer, a 2D convolution is performed with shared weights and biases within a kernel as follows: where b is the shared bias, w l,m is a f × f array with shared weights, a denotes the activation position within a kernel with the origin of ( j, k), and σ is the rectified linear unit (ReLU) activation function. The process is followed by a max pooling layer with the size of 2 × 2.
Extracted features feed a one-dimensional fully-connected layer followed by a softmax classifier. Using the softmax classifier, in addition to the classification result, the network returns the probability of each patch belonging to each class as follow: where p j is the probability of class j, z is the activation value of the corresponding output class, and the denominator is the sum of the probability of all classes (k). The network is trained by using log-likelihood loss function as follows: where n is the number of samples, K is the total number of classes, y i k is the true vector, andŷ i k is the predicted vector by the network. Network parameters are optimized using the stochastic gradient decent method and the backpropagation algorithm [53].
To regularize the network, we use drop-out layers with a rate of 0.5 after each pooling layer [54]. We keep the drop-out rate of 0.5 throughout the analysis as our deep network is inspired by VGG [51], which uses this rate in [55]. We initialize weights as sqrt(2/number of input neurons) based on [56] to prevent saturation in the network and to increase learning pace. Moreover, we give higher learning rates for the first epochs and gradually decrease it when the learning curve is converging to speed-up the learning process. The network is allowed to train to a maximum of 700 epochs to make sure that the loss function is minimized. We prevent overfitting by using drop-out and stochastic gradient descent with mini batches. We also monitored both training and validation loss functions after each epoch to make sure they have a similar decreasing pattern. Figure 6 shows the architecture of the shallow CNN and Table 2 shows a summary of the network's hyper-parameters. Training the network is carried out with MATLAB and MatConvNet library [57]. We compiled networks on the GPU which significantly improves the learning speed [57]. This study trains networks on an NVIDIA QUADRO 1000M GPU with CUDA toolkit and cuDNN library. To regularize the network, we use drop-out layers with a rate of 0.5 after each pooling layer [54]. We keep the drop-out rate of 0.5 throughout the analysis as our deep network is inspired by VGG [51], which uses this rate in [55]. We initialize weights as sqrt(2/number of input neurons) based on [56] to prevent saturation in the network and to increase learning pace. Moreover, we give higher learning rates for the first epochs and gradually decrease it when the learning curve is converging to speed-up the learning process. The network is allowed to train to a maximum of 700 epochs to make sure that the loss function is minimized. We prevent overfitting by using drop-out and stochastic gradient descent with mini batches. We also monitored both training and validation loss functions after each epoch to make sure they have a similar decreasing pattern. Figure 6 shows the architecture of the shallow CNN and Table 2 shows a summary of the network's hyper-parameters. Training the network is carried out with MATLAB and MatConvNet library [57]. We compiled networks on the GPU which significantly improves the learning speed [57]. This study trains networks on an NVIDIA QUADRO 1000M GPU with CUDA toolkit and cuDNN library.  We also take advantage of using popular networks in the field of image recognition to train a deeper CNN. Since these networks only accept patches with three channels as input and our images have four channels, we train a network from scratch inspired by visual geometry group (VGG-F) [51] to solve our classification problem. The network is deep enough to solve the ImageNet large-scale visual recognition challenge (ILSVRC), but it is computationally not too expensive, so we can train it on a GPU with 2GB of RAM having inputs of four channels. The original VGG networks use local response normalization (LRN) [51], but we use batch normalization (BNorm) instead, since it is more effective [58] (Figure 7). Both shallow and deep CNNs are trained using 2000 training and 2000 validation patches to compare the performance of the two networks.   We also take advantage of using popular networks in the field of image recognition to train a deeper CNN. Since these networks only accept patches with three channels as input and our images have four channels, we train a network from scratch inspired by visual geometry group (VGG-F) [51] to solve our classification problem. The network is deep enough to solve the ImageNet large-scale visual recognition challenge (ILSVRC), but it is computationally not too expensive, so we can train it on a GPU with 2GB of RAM having inputs of four channels. The original VGG networks use local response normalization (LRN) [51], but we use batch normalization (BNorm) instead, since it is more effective [58] (Figure 7). Both shallow and deep CNNs are trained using 2000 training and 2000 validation patches to compare the performance of the two networks.
Using image augmentation, we increase the number of training patches [59]. Based on [26], each patch is rotated in seven directions; 7, 90, 97, 180, 187, 270, and 277 degrees with linear interpolation. Therefore, the deep network is trained again using 16,000 training patches to explore any improvement by using image augmentation. The accuracy of the best-performing network is assessed using 2000 independent test patches.
have four channels, we train a network from scratch inspired by visual geometry group (VGG-F) [51] to solve our classification problem. The network is deep enough to solve the ImageNet large-scale visual recognition challenge (ILSVRC), but it is computationally not too expensive, so we can train it on a GPU with 2GB of RAM having inputs of four channels. The original VGG networks use local response normalization (LRN) [51], but we use batch normalization (BNorm) instead, since it is more effective [58] (Figure 7). Both shallow and deep CNNs are trained using 2000 training and 2000 validation patches to compare the performance of the two networks.  We use the best-performing network to solve the regression problem and predict the DIMD by transfer learning. The loss function is changed to Euclidean loss as follow: As there are only 121 samples to fine-tune the network, we use 10-fold cross-validation to assess the performance. To evaluate the overall predicting power of the model, we calculate the coefficient of determination (R 2 ) on the validation samples. We use R 2 to evaluate our models as it is a common way to assess a model's performance in this field (e.g., [19,31,32]), and the measure is unitless so this makes the results more comparable across study areas and with results of similar studies.

PCR Models Using Hand-Crafted and GIS Features
This study employs principal component regression (PCR) models using hand-crafted image features and GIS features extracted from the slum patches to (1) compare the result of building models based on these features with CNN results, and, (2) explore possible improvements these features can bring to the CNN results. Training PCR models enables the use of many features, reducing them to a few components, and building regression models with components to predict the DIMD. Table 3 lists the extracted features, covering three groups: • Spectral information (Table 3; Spectral info.).

•
Two sets of the most common texture features; grey level co-occurrence matrix (GLCM) and local binary pattern (LBP). We generate GLCM features in four directions and four lags (i.e., 1 to 4 pixels) and based on [17], we calculate three properties-entropy, variance, and contrast-on each feature. We calculate GLCM properties on each band of a patch and consider the mean value as the property value (Table 3; GLCM).

•
To include LBP features in the model, we extract only uniform patterns (with a maximum of two transitions), which provide the most important textural information about an image [60]. Based on [18], we calculate LBP riu2 8,1 (i.e., rotation invariant uniform patterns with a radius of 1, which considers eight neighbors), LBP riu2 16,2 , and LBP riu2 24,3 with linear interpolation. We average the extracted LBP of each band to obtain the value for a patch considering the whole patch as a cell (Table 3; LBP).
• GIS features; as road data are not consistent enough to perform network analysis, we calculate the minimum Euclidean distances from each of the public service/land use (Table 3; GIS) to a patch's center points. Distance to different land uses and public services have been used to calculate the degree of deprivation of settlements especially in UK deprivation indices (e.g., [44]). We consider the town hall as the center of the city, which is very close to the geographic center of the city. Using the elevation layer, we calculate the mean elevation and mean slope within each patch.  (7) school, (8) leisure activities; Centrality: distance to (9) town hall; Environment: (10) distance to waterbody (11) elevation mean (12) slope mean. 12 We use the extracted features to feed stepwise PCR models. Different combinations of features are trained, with a different number of components, and different model complexities (pure linear; linear allowing interaction, i.e., multiplication of two components as a new variable; and quadratic allowing interaction) (Figure 8). For the evaluation, 10-fold cross-validation is used.  Centrality: distance to (9) town hall; Environment: (10) distance to waterbody (11) elevation mean (12) slope mean. 12 We use the extracted features to feed stepwise PCR models. Different combinations of features are trained, with a different number of components, and different model complexities (pure linear; linear allowing interaction, i.e., multiplication of two components as a new variable; and quadratic allowing interaction) (Figure 8). For the evaluation, 10-fold cross-validation is used.

Ensemble Regression Models
We build ensemble regression models using the outputs of the best performing CNN and PCR models. These ensemble models are trained, varying the complexity from linear to polynomial. Figure 9a shows the squared correlation R 2 between HH indicators and the HH DIMD. It shows that electricity, floor material, wall material, toilet, roof material, and water sources have the largest contribution to build the HH DIMD and to explain variations across the 1114 households.

DIMDs
Interpreting households along the HH DIMD gives an overview of the variations of households within a slum and confirms the meaningfulness of aggregating DIMD values for a slum settlement. Figure 9b plots the aggregated HH DIMD value of households within each slum by calculating the mean and standard deviation. Slums with a value around the origin (i.e., zero) have the most common (or average) pattern of categories and slums with high and low values are clearly distinct. Considering the slums along the HH DIMD and comparing the values with the ground situation and HH data, we find that negative HH DIMD values represent worse-off slums and positive HH DIMD values represent better-off slums in terms of deprivation. Regarding the range of values, worse-off slums are significantly different from the common pattern, but it is not the case for better-off slums. Figure 9b also shows that the internal variation in the better-off slums is less than in the worse-off slums. Although these variations are quite high in some cases, considering standard deviations, it is meaningful to measure one single value as the DIMD for each slum because households living in one slum have mostly close DIMD values, meaning they have a similar situation in terms of basic services like electricity, and construction material of the dwellings (Figure 9a).

Ensemble Regression Models
We build ensemble regression models using the outputs of the best performing CNN and PCR models. These ensemble models are trained, varying the complexity from linear to polynomial. Figure 9a shows the squared correlation R 2 between HH indicators and the HH DIMD. It shows that electricity, floor material, wall material, toilet, roof material, and water sources have the largest contribution to build the HH DIMD and to explain variations across the 1114 households.

DIMDs
Interpreting households along the HH DIMD gives an overview of the variations of households within a slum and confirms the meaningfulness of aggregating DIMD values for a slum settlement. Figure 9b plots the aggregated HH DIMD value of households within each slum by calculating the mean and standard deviation. Slums with a value around the origin (i.e., zero) have the most common (or average) pattern of categories and slums with high and low values are clearly distinct. Considering the slums along the HH DIMD and comparing the values with the ground situation and HH data, we find that negative HH DIMD values represent worse-off slums and positive HH DIMD values represent better-off slums in terms of deprivation. Regarding the range of values, worse-off slums are significantly different from the common pattern, but it is not the case for better-off slums. Figure 9b also shows that the internal variation in the better-off slums is less than in the worse-off slums. Although these variations are quite high in some cases, considering standard deviations, it is meaningful to measure one single value as the DIMD for each slum because households living in one slum have mostly close DIMD values, meaning they have a similar situation in terms of basic services like electricity, and construction material of the dwellings (Figure 9a). By performing MCA on the QS samples, which are more diverse and have larger spatial coverage than the HH samples, we create a more comprehensive pattern of deprivation of slums. Figure 10 shows the QS samples with their QS DIMD values on a map. It also shows four sample photos (taken during the QS fieldwork) having the smallest to largest DIMD values. Considering the value range of the DIMD, better-off slums are significantly different from the common pattern, but the worse-off slums are less different (see Figure 10 sample number 2 with a value around zero and compare it to the high-end and the low-end values). This is also shown by the photos displaying the ground situation.  By performing MCA on the QS samples, which are more diverse and have larger spatial coverage than the HH samples, we create a more comprehensive pattern of deprivation of slums. Figure 10 shows the QS samples with their QS DIMD values on a map. It also shows four sample photos (taken during the QS fieldwork) having the smallest to largest DIMD values. Considering the value range of the DIMD, better-off slums are significantly different from the common pattern, but the worse-off slums are less different (see Figure 10 sample number 2 with a value around zero and compare it to the high-end and the low-end values). This is also shown by the photos displaying the ground situation. By performing MCA on the QS samples, which are more diverse and have larger spatial coverage than the HH samples, we create a more comprehensive pattern of deprivation of slums. Figure 10 shows the QS samples with their QS DIMD values on a map. It also shows four sample photos (taken during the QS fieldwork) having the smallest to largest DIMD values. Considering the value range of the DIMD, better-off slums are significantly different from the common pattern, but the worse-off slums are less different (see Figure 10 sample number 2 with a value around zero and compare it to the high-end and the low-end values). This is also shown by the photos displaying the ground situation.  This means we are 95% confident that the two DIMDs are positively correlated among the whole slums in Bangalore with a coefficient in the range of [0. 28, 0.82]. In this sense, the two DIMDs are both describing deprivation (as they are correlated) and it is meaningful to use the QS DIMD as a measure of deprivation. However, they look at the deprivation concept from different perspectives. This means one cannot fully explain variations of the other. This is indicated by a R 2 is 0.40 [0.08, 0.67]. We should also consider the temporal gap between the HH and QS data. Although the 26 samples are checked using Google Earth to ensure they have not significantly changed since 2010, there is a possibility that some of them experienced changes which are not visible in satellite images.

CNN-Based Model Performance
The result of training shallow to deep CNNs are shown in Figure 11. Figure 11a shows the result of the shallow CNNs with different patch sizes. The patch size of 129 results in the highest accuracy on the validation set, and, thus, we use this patch size to train shallow and deep CNNs again with 2000 training/validation samples. Figure 11b shows the obtained accuracy using the shallow network with 2000 training samples, the deep network with 2000 training samples, and the deep network with image augmentation and 16,000 training samples. Comparing the performance of the shallow network with the deep one using the same number of samples, the classification error drops by almost 50% (from 7.00% to 3.50%). This shows the advantage of using deeper networks and extracting more abstract features. Taking advantage of image augmentation, the classification error drops by almost 40% (from 3.50% to 1.44%), and we reach the overall accuracy of 98.56% on the validation set. are both describing deprivation (as they are correlated) and it is meaningful to use the QS DIMD as a measure of deprivation. However, they look at the deprivation concept from different perspectives. This means one cannot fully explain variations of the other. This is indicated by a R 2 is 0.40 [0.08, 0.67]. We should also consider the temporal gap between the HH and QS data. Although the 26 samples are checked using Google Earth to ensure they have not significantly changed since 2010, there is a possibility that some of them experienced changes which are not visible in satellite images.

CNN-based Model Performance
The result of training shallow to deep CNNs are shown in Figure 11. Figure 11a shows the result of the shallow CNNs with different patch sizes. The patch size of 129 results in the highest accuracy on the validation set, and, thus, we use this patch size to train shallow and deep CNNs again with 2000 training/validation samples. Figure 11b shows the obtained accuracy using the shallow network with 2000 training samples, the deep network with 2000 training samples, and the deep network with image augmentation and 16,000 training samples. Comparing the performance of the shallow network with the deep one using the same number of samples, the classification error drops by almost 50% (from 7.00% to 3.50%). This shows the advantage of using deeper networks and extracting more abstract features. Taking advantage of image augmentation, the classification error drops by almost 40% (from 3.50% to 1.44%), and we reach the overall accuracy of 98.56% on the validation set. Using the 2000 test patches on the best-performing CNN, we reach the accuracy of 98.40%. Figure 12 shows some slum patches (test set) classified by this network. All these patches are slums, but some are incorrectly classified as formal. The percentages below the patches show the confidence of the network in classifying these patches as slums (derived from the softmax layer). Scores of less than 50% result in classifying patches as formal. Patches like number 1 are clearly classified as slums. They have very distinct characteristics with small dwellings and irregular patterns, easily distinguishable from formal areas. Slums like number 2 with some regular patterns are classified as slums with less confidence. Patch 3 is challenging, containing small slums between formal areas. Although it is not easy to identify the slum area between formal areas, it is also correctly classified. Patch 4 and 5 have almost the same situation but dwellings in patch 5 are tiny and we cannot even confidently recognize them by sight. Patches like number 6 completely confuse the network as they have larger dwellings with some regular patterns. Overall, only 1.92% of slum patches (19 out of 1000 patches) are classified incorrectly (Figure 12). Using the 2000 test patches on the best-performing CNN, we reach the accuracy of 98.40%. Figure 12 shows some slum patches (test set) classified by this network. All these patches are slums, but some are incorrectly classified as formal. The percentages below the patches show the confidence of the network in classifying these patches as slums (derived from the softmax layer). Scores of less than 50% result in classifying patches as formal. Patches like number 1 are clearly classified as slums. They have very distinct characteristics with small dwellings and irregular patterns, easily distinguishable from formal areas. Slums like number 2 with some regular patterns are classified as slums with less confidence. Patch 3 is challenging, containing small slums between formal areas. Although it is not easy to identify the slum area between formal areas, it is also correctly classified. Patch 4 and 5 have almost the same situation but dwellings in patch 5 are tiny and we cannot even confidently recognize them by sight. Patches like number 6 completely confuse the network as they have larger dwellings with some regular patterns. Overall, only 1.92% of slum patches (19 out of 1000 patches) are classified incorrectly (Figure 12). Remote Sens. 2019, 11, x FOR PEER REVIEW 15 of 24 We fine-tune the pre-trained CNN with its learned distinctive features to directly predict the QS DIMD values and train each network for 100 epochs to ensure convergence. The model predicts the QS DIMD with the R 2 of 0.67.

PCR Model Performance
As a supplementary step, we train PCR models based on manually extracted hand-crafted and GIS features. We develop three categories of models; using only hand-crafted features, using only GIS features, and using both hand-crafted and GIS features ( Figure 13).
Using only hand-crafted features, the R 2 of 0.38 is obtained. Relying only on GIS features, the model can reach the R 2 of 0.25. The best model is trained using both hand-crafted and GIS features, involving 11 components in a linear regression that delivers the R 2 of 0.52. The R 2 values below −1 are plotted as −1 for better visualization.

Ensemble Models
We build ensemble models with different complexities in three combinations; CNN + handcrafted + GIS, CNN + hand-crafted, CNN + GIS ( Table 4). The best result is obtained using the combination of all the three categories of features in a 3rd degree polynomial regression with the R 2 of 0.75. We fine-tune the pre-trained CNN with its learned distinctive features to directly predict the QS DIMD values and train each network for 100 epochs to ensure convergence. The model predicts the QS DIMD with the R 2 of 0.67.

PCR Model Performance
As a supplementary step, we train PCR models based on manually extracted hand-crafted and GIS features. We develop three categories of models; using only hand-crafted features, using only GIS features, and using both hand-crafted and GIS features ( Figure 13).
Using only hand-crafted features, the R 2 of 0.38 is obtained. Relying only on GIS features, the model can reach the R 2 of 0.25. The best model is trained using both hand-crafted and GIS features, involving 11 components in a linear regression that delivers the R 2 of 0.52. The R 2 values below −1 are plotted as −1 for better visualization. We fine-tune the pre-trained CNN with its learned distinctive features to directly predict the QS DIMD values and train each network for 100 epochs to ensure convergence. The model predicts the QS DIMD with the R 2 of 0.67.

PCR Model Performance
As a supplementary step, we train PCR models based on manually extracted hand-crafted and GIS features. We develop three categories of models; using only hand-crafted features, using only GIS features, and using both hand-crafted and GIS features ( Figure 13).
Using only hand-crafted features, the R 2 of 0.38 is obtained. Relying only on GIS features, the model can reach the R 2 of 0.25. The best model is trained using both hand-crafted and GIS features, involving 11 components in a linear regression that delivers the R 2 of 0.52. The R 2 values below −1 are plotted as −1 for better visualization.

Ensemble Models
We build ensemble models with different complexities in three combinations; CNN + handcrafted + GIS, CNN + hand-crafted, CNN + GIS ( Table 4). The best result is obtained using the combination of all the three categories of features in a 3rd degree polynomial regression with the R 2 of 0.75. Figure 13. Performance of the PCR models. Interaction means allowing the multiplication of the components. The best result is obtained by combining both GIS and hand-crafted image features in a linear regression using 11 components.

Ensemble Models
We build ensemble models with different complexities in three combinations; CNN + hand-crafted + GIS, CNN + hand-crafted, CNN + GIS ( Table 4). The best result is obtained using the combination of all the three categories of features in a 3rd degree polynomial regression with the R 2 of 0.75. To explore our final ensemble model (with R 2 of 0.75), we visualized a scatter plot of QS DIMD and predicted values in Figure 14, as well as the six worst-off and the six best-off slums predicted by the model in Figure 15. The model has the RMSE value of 0.53 and the BIAS of 0.20. BIAS is calculated by averaging the BIAS of all predictions and shows, on average, that the model tends to predict values 0.2 units higher than the observed value. RMSE shows that the average error in each prediction is 0.53 units. When variation at the negative side of the QS DIMD is less, the model performs better. This can be confirmed by comparing QS DIMD and predicted values. On the positive side, although the DIMD is mostly predicted well, there is still some confusion in the model (see best-off slums number 2 and 3 in Figure 15). Comparing number 2 and 4 of the best-off slums in Figure 15, patches are very similar, and indicators that make number 2 worse than number 4 in the DIMD might not be visible in remote sensing images. It can also be an error coming from the fieldwork as in the QS fieldwork the surveyor was standing on one point and reported what was visible. There is a possibility that this point is different from the typical structure of that slum, especially in large settlements. The other source of error can be using a fixed square patch to extract features from all the slums. Some settlements are large, and a patch can cover a small area of them; however, some settlements are tiny, and even if we consider their context (i.e., 20-m buffer), a patch is bigger than the area of the analysis. Thus, we ignore all the area of the large settlements except a patch located at the center and there is a possibility that the patch does not represent the whole area of the settlement.  To explore our final ensemble model (with R 2 of 0.75), we visualized a scatter plot of QS DIMD and predicted values in Figure 14, as well as the six worst-off and the six best-off slums predicted by the model in Figure 15. The model has the RMSE value of 0.53 and the BIAS of 0.20. BIAS is calculated by averaging the BIAS of all predictions and shows, on average, that the model tends to predict values 0.2 units higher than the observed value. RMSE shows that the average error in each prediction is 0.53 units. When variation at the negative side of the QS DIMD is less, the model performs better. This can be confirmed by comparing QS DIMD and predicted values. On the positive side, although the DIMD is mostly predicted well, there is still some confusion in the model (see best-off slums number 2 and 3 in Figure 15). Comparing number 2 and 4 of the best-off slums in Figure 15, patches are very similar, and indicators that make number 2 worse than number 4 in the DIMD might not be visible in remote sensing images. It can also be an error coming from the fieldwork as in the QS fieldwork the surveyor was standing on one point and reported what was visible. There is a possibility that this point is different from the typical structure of that slum, especially in large settlements. The other source of error can be using a fixed square patch to extract features from all the slums. Some settlements are large, and a patch can cover a small area of them; however, some settlements are tiny, and even if we consider their context (i.e., 20-m buffer), a patch is bigger than the area of the analysis. Thus, we ignore all the area of the large settlements except a patch located at the center and there is a possibility that the patch does not represent the whole area of the settlement.   To have a more in-depth look at the performance of the models and assess their generalization capability, prediction errors are plotted and explored. We focus on the model created by the CNN with R 2 of 0.67, the model created with hand-crafted and GIS features with R 2 of 0.52, and the ensemble model created by combining CNN, handcrafted, and GIS results with R 2 of 0.75. One crucial assumption to consider when one wants to generalize a model is homoscedasticity, i.e., expecting the same variance of residuals across the whole range of predicted values [61]. Figure 16 plots standardized residuals over predicted values of the three models. Optimally, there should be scattered points without any systematic pattern; however, our plots violate this assumption and we can find different patterns among negative and positive predicted values. The errors are less evenly distributed on the negative side of the models based on CNN (Figure 16a), and hand-crafted + GIS features (Figure 16b). Nevertheless, in all plots, residuals have different patterns on the negative and positive sides. Comparing Figures 16 and 10, less difference (more homogeneity) results in less variance in the residuals. The worse-off slums are more similar to each other, so the predictions also have smaller errors. However, better-off slums are very different from each other. Comparing photo To have a more in-depth look at the performance of the models and assess their generalization capability, prediction errors are plotted and explored. We focus on the model created by the CNN with R 2 of 0.67, the model created with hand-crafted and GIS features with R 2 of 0.52, and the ensemble model created by combining CNN, handcrafted, and GIS results with R 2 of 0.75. One crucial assumption to consider when one wants to generalize a model is homoscedasticity, i.e., expecting the same variance of residuals across the whole range of predicted values [61]. Figure 16 plots standardized residuals over predicted values of the three models. Optimally, there should be scattered points without any systematic pattern; however, our plots violate this assumption and we can find different patterns among negative and positive predicted values. The errors are less evenly distributed on the negative side of the models based on CNN (Figure 16a), and hand-crafted + GIS features (Figure 16b). Nevertheless, in all plots, residuals have different patterns on the negative and positive sides. Comparing Figures 10  and 16, less difference (more homogeneity) results in less variance in the residuals. The worse-off slums are more similar to each other, so the predictions also have smaller errors. However, better-off slums are very different from each other. Comparing photo 2 and 4 in Figure 10, there is a wide difference between the average situation (i.e., value 0) and the best-off slum. Therefore, residuals have more variance, and the predictions are less accurate. For instance, consider photo 2 and 3 of the best-off slums in Figure 15, the model has larger errors.
Remote Sens. 2019, 11, x FOR PEER REVIEW 18 of 24 2 and 4 in Figure 10, there is a wide difference between the average situation (i.e., value 0) and the best-off slum. Therefore, residuals have more variance, and the predictions are less accurate. For instance, consider photo 2 and 3 of the best-off slums in Figure 15, the model has larger errors.

Discussion
This study proposes data-driven methods for creating a deprivation index as a solution to the classical deprivation indices, many of which suffer from subjective weights assignment, as well as predicting the index from VHR satellite images using CNN. The aim is to build a comprehensive understanding of socioeconomic variations of slums, covering multiple domains of deprivation, and provide information for designing targeted policies to support, upgrade, and monitor such settlements. We show the ability of VHR satellite images to predict the degree of deprivation even for tiny slums with few dwellings. Our method can capture deprivation of any type of slum in Bangalore regardless of its size. The proposed method enables dealing with the small number of available training samples to train a CNN for predicting the DIMD by a two-step transfer learning process that is a clear innovation compared to other studies using a small subset and taking most of the areas of training to predict a small part that was used. This helps to take advantage of deeply learned features in solving problems related to slum studies which always suffer from limitations of data availability. Most studies focus on distinguishing slums from formal settlements as a binary classification and do not offer the possibility to estimate a continuous index to characterize the deprivation level using CNNs (e.g., [29,31]).
We use the MCA method to build deprivation indices by a data-driven approach with few assumptions. Categorical indicators are used without manipulation, and index values are assigned to individuals based on the patterns of categories. We find that relying only on the pattern of data, without pre-assumptions like ordering categories and assigning pre-defined weights, it is possible to distinguish the better-off, the worse-off, and the average situation of slum settlements with their relative differences. For instance, [35] ordered categorical data, transferred them to ordinal data, and aggregated them using descriptive statistics to build a slums index. Similarly, [43] aggregated indicators of a deprivation index with equal weights. Although these methods are very common and based on experts' knowledge, they might include many assumptions and bias the result. Based on our data-driven approach, the importance of deprivation dimensions is not the same. Comparing the DIMD values with the ground situation, we can empirically prove that the first dimension of deprivation provides a meaningful measure, showing the variability of deprived areas across a quantitative range (see supplementary materials Section S3 for ground photos of slums with respective QS DIMD values).
We show that the indicators related to the physical capital play the most crucial role in distinguishing slum types. These two domains also shape the visible features from the satellite image.

Discussion
This study proposes data-driven methods for creating a deprivation index as a solution to the classical deprivation indices, many of which suffer from subjective weights assignment, as well as predicting the index from VHR satellite images using CNN. The aim is to build a comprehensive understanding of socioeconomic variations of slums, covering multiple domains of deprivation, and provide information for designing targeted policies to support, upgrade, and monitor such settlements. We show the ability of VHR satellite images to predict the degree of deprivation even for tiny slums with few dwellings. Our method can capture deprivation of any type of slum in Bangalore regardless of its size. The proposed method enables dealing with the small number of available training samples to train a CNN for predicting the DIMD by a two-step transfer learning process that is a clear innovation compared to other studies using a small subset and taking most of the areas of training to predict a small part that was used. This helps to take advantage of deeply learned features in solving problems related to slum studies which always suffer from limitations of data availability. Most studies focus on distinguishing slums from formal settlements as a binary classification and do not offer the possibility to estimate a continuous index to characterize the deprivation level using CNNs (e.g., [29,31]).
We use the MCA method to build deprivation indices by a data-driven approach with few assumptions. Categorical indicators are used without manipulation, and index values are assigned to individuals based on the patterns of categories. We find that relying only on the pattern of data, without pre-assumptions like ordering categories and assigning pre-defined weights, it is possible to distinguish the better-off, the worse-off, and the average situation of slum settlements with their relative differences. For instance, [35] ordered categorical data, transferred them to ordinal data, and aggregated them using descriptive statistics to build a slums index. Similarly, [43] aggregated indicators of a deprivation index with equal weights. Although these methods are very common and based on experts' knowledge, they might include many assumptions and bias the result. Based on our data-driven approach, the importance of deprivation dimensions is not the same. Comparing the DIMD values with the ground situation, we can empirically prove that the first dimension of deprivation provides a meaningful measure, showing the variability of deprived areas across a quantitative range (see Supplementary Materials Section S3 for ground photos of slums with respective QS DIMD values).
We show that the indicators related to the physical capital play the most crucial role in distinguishing slum types. These two domains also shape the visible features from the satellite image. Therefore, the method is based on the assumption that these two domains contribute to building the DIMD.
There are also some limitations to using MCA. Our approach is meaningful when there is a set of representative samples available. Non-representative samples, like the HH samples, which are not a good representation of slums in 2017, result in less meaningful DIMD values. Considering  Figures 9b and 10, and comparing the range of the two DIMDs, almost opposite trends of skewness are shown. The average situation of the HH DIMD is more similar to the better-off slums, but in the case of the QS DIMD, the better off slums are significantly different from the average situation. This shows that the HH samples are mostly covering the better-off and are not a sufficient representation of all slums in Bangalore in 2017. In fact, the HH samples cover only the notified slums and they have often received upgrading from the government, but QS samples also include non-notified slums which are not included in official slum data in India. It is also important that the data contains only deprivation-related indicators. If non-related indicators are included and they contain information which is highly different across slums, it will bias the result significantly. Other less data-dependent options to create an index can be used in the case of non-representative or not well-distributed samples, which consider experts' knowledge, for example, by means of analytic hierarchy process (AHP) [62].
Using all hand-crafted + GIS features, 11 components, and a linear PCR model, we obtain R 2 of 0.52 to predict the DIMD. Comparing this result with the CNN result, the CNN with R 2 of 0.67 outperforms our complex PCR models. For a better comparison, we should compare the CNN result with the result of the regression using only hand-crafted features as the CNN also does not consider where the patches are located. In this manner, we have the R 2 of 0.38 from hand-crafted features using a quadratic model in comparison with R 2 of 0.67 from the CNN. This shows the advantage of using deeply learned features by the CNN over hand-crafted features and the PCR model. More sophisticated hand-crafted features as extracted in many studies like [10] and [20] might improve the result. The features we use to train the PCR models can be generated for any study area, but the importance of the features might be different. For the transferability of the method, future studies can focus on the overall methodology, instead of single features. We show that by adding GIS layers to hand-crafted features, the result can be boosted significantly. Besides regression models with different complexities, it is worth exploring the ability of other machine learning methods to predict the DIMD. For example, [19] showed GBR and RF could outperform a linear regression model, so it is worth exploring such algorithms in further studies. Furthermore, future studies can train CNNs optimizing more parameters (and hyper parameters). For instance, we used drop-out layers with a rate of 0.5, but it is worth exploring lower values as well.
Using the combination of hand-crafted + CNN, the R 2 remains 0.67. This means the hand-crafted features cannot contribute to improving CNN extracted features. However, the result from the GIS layers can improve the CNN result to 0.71. This means by adding the GIS layers, we include additional information to the CNN. The best result obtained is 0.75 by using CNN + hand-crafted + GIS in a third-degree polynomial model, allowing interaction between variables. This shows that, although hand-crafted features cannot improve the result of the CNN, their interactions with GIS features can bring improvements to the model. We conclude that using GIS layers in parallel with CNN can bring improvements to the model as basically a CNN does not consider the spatial location of patches. We also show that the CNN-based model has more capabilities of generalization compared to the PCR model as the error values are more normally distributed. Although the BIAS value in the best-performed model is low (0.20), the RMSE value (0.53) shows that there is a risk to consider worse-off slums as more common settlements as the value range of the worse-off slums is very low (i.e., between −1.10 and 0). This shows a potential to overestimate the model performance if we only consider the R 2 value, especially for samples in the negative side. Figure 16 shows that slums having positive DIMD values have distinct error patterns compared to the negative side. This proves the fact that slums are forming groups/types which are very different from each other. Furthermore, it shows that the label slum covers very diverse areas. In the case of Bangalore, our analysis shows two main different groups of slums, but research on other cities might find more groups of slums, leading to a more diverse typology. Spatial independence of samples is a crucial assumption in regression models and it also affects the result of accuracy assessment. By looking at Figure 10 (QS samples) we can see spatial clusters of slums which represents the spatial dependency of the samples. The problem comes from two sources. First, slums are not evenly distributed across the city (which is very common across cities with slum areas). Second, to maximize the number of collected samples within the time available, we collected samples from selected clusters. To deal with the problem of autocorrection, we sampled dispersed clusters and selected random samples within clusters. To create folds, we selected slums randomly from all clusters. We also created each CNN patch from a different slum sample to make sure there is no overlap, which means that patches are not created from the same slum (to avoid violating the requirement of independence). Therefore, our folds are less spatially correlated. Furthermore, we added the GIS layers that provide insights into the spatial organization of the samples. In fact, we used GIS layers as spatial components to the model. Further studies might bring spatial components to the model using other methods like Lagrange Multiplier [7].
Relying on data-driven approaches, the method is transferable to other contexts. Both MCA and CNN analyze data with no need for data manipulation, so it is possible to feed them with new samples. Relying on hand-crafted features, developed methods are always context-specific and are rarely transferable to another context (e.g., [10,31,63]). However, CNN automatically extracts features based on training samples, so it determines the most important features itself. It is needed to tune CNN hyperparameters and patch size, but the overall approach remains the same. It is also possible to measure deprivation/socio-economic characteristics (in the case of available socio-economic data) of areas other than slums with the same approach. In some contexts, there are areas which are deprived but not labeled slums administratively [34]. Thus, it is relevant to apply this method to a whole range of existing settlements in a specific context and explore their relative differences.

Conclusions
This study analyzed the relationship between slum variations from the perspective of deprivation with image-based features. The use of two main data-driven approaches (i.e., MCA and CNN) to analyze socio-economic data and VHR satellite images distinguishes this research from related previous studies. The combination of the two methods built a holistic methodological framework that could use surveyed data and satellite images as inputs and perform analyses with no need for data manipulation. This, coupled with our two-step transfer learning approach to train a deep CNN dealing with a limited number of training samples, resulted in an R 2 of 0.75 to predict slums' degree of deprivation in Bangalore using VHR imagery. The high diversity of slum settlements makes it challenging to build a single unbiased model to predict the degree of deprivation. Although the model showed different behavior in predicting better-off and worse-off slums, the proposed method opens the door to explore the possibility of building models with a better generalization capability. To deal with the issue of heteroscedasticity, two possible solutions can be followed by further studies: First, using common statistical methods like value transformation (e.g., log transformation) of the DIMD values, or switching to weighted least square regression from ordinary least square which could be integrated into the CNN and PCR models; second, dividing slums into two worse-off and better-off groups (i.e., slums with a negative DIMD value and slums with a positive DIMD value), so it is more likely to create models with less biased predictions. However, this solution needs more samples, especially with positive DIMD values. The study simplified slum samples by creating a 20m buffer and changing the pixel values outside the buffer to zero. Although we added this step to prevent confusion to the CNN models, further studies can train CNN models that can deal with this without this pre-processing step. This study combined GIS layers with the CNN in an ensemble model; however, the possibility of integrating GIS layers within the CNN framework is worth exploring. One option to explore in further studies is to create GIS features and add them to the original patches as new image channels. Suppose instead of having only a 4-channel image, we add new GIS-based channels as inputs to the CNN and let it solve the regression problem. Indeed, this needs high computational capacity as adding new channels increases the number of CNN parameters significantly. Furthermore, the transferability to other contexts or focusing on other types of settlements other than slums need further exploration. In respect to the creation of DIMD, we used the first dimension created by MCA to analyze deprivation along two sides (i.e., positive and negative). Further studies can add the second dimension, analyzing individuals in a two-dimensional space along four sides, and predict both dimensions using transfer learning. Ultimately, the developed model enables a deeper understanding of slums in Bangalore and can help policymakers to prioritize and establish pro-poor people-based policies to address people's needs. It characterizes spatial patterns of deprivation relying on satellite images and helps to understand where the deprivation is, and what should be the target of upgrading programs to address it. Furthermore, the output of this model can be used to feed other models like agent-based models to simulate and predict the dynamics of such settlements. It is a possibility to connect the result of this work to studies related to health, well-being, and urban land-use modeling to create a basis and help policymakers establish effective policies towards enhancing the quality of life.
Supplementary Materials: The following are available online at www.mdpi.com/xxx/s1, Section S1: HH data; Section S2: QS data; Section S3: Ground photos from QS samples.
Author Contributions: A.A. performed the data analyses and wrote the majority of the paper. M.K., C.P., and K.P. supported in developing the structure of the paper, supervision, and revising the paper.

Funding:
We would like to acknowledge the support of the SimCity project (contract number: C.2324.0293) for data collection and the support from the NWO/Netherlands eScience Center funded project DynaSlum-Data Driven Modelling and Decision Support for Slums-under the contract number 27015G05.