Upscaling Household Survey Data Using Remote Sensing to Map Socioeconomic Groups in Kampala, Uganda

: Sub-Saharan African cities are expanding horizontally, demonstrating spatial patterns of urban sprawl and socioeconomic segregation. An important research gap around the geographies of urban populations is that city-wide analyses mask local socioeconomic inequalities. This research focuses on those inequalities by identifying the spatial settlement patterns of socioeconomic groups within the Greater Kampala Metropolitan Area (Uganda). Findings are based on a novel dataset, an extensive household survey with 541 households, conducted in Kampala in 2019. To identify di ﬀ erent socioeconomic groups, a k-prototypes clustering method was applied to the survey data. A maximum likelihood classiﬁcation method was applied on a recent Landsat-8 image of the city and compared to the socioeconomic clustering through a fuzzy error matrix. The resulting maps show how di ﬀ erent socioeconomic clusters are located around the city. We propose a simple method to upscale household survey responses to a larger study area, to use these data as a base map for further analysis or urban planning purposes. Obtaining a better understanding of the spatial variability in socioeconomic dynamics can aid urban policy-makers to target their decision-making processes towards a more favorable and sustainable future.


Introduction
Over half the world's population lives in cities, and urban areas are currently expanding much faster in developing countries than in developed countries. The population of African (mega-) cities is expected to triple by 2050 [1]. In Sub-Saharan Africa (SSA), cities are expanding horizontally, demonstrating spatial patterns of urban sprawl [2]. Sprawl can be defined as growth due to the emergence of new low-density suburbs with (semi-)detached housing [3]. A lack of city management and zoning is often regarded as the cause of urban sprawl in SSA. In turn this leads to a lack of compactness and service efficiency in the affected areas [3,4]. This has a significant impact on the livelihoods of residents both in the inner city and the greater metropolitan area.
There have been several efforts to identify spatial patterns of (socioeconomic) residential differences in developing countries using remotely sensed imagery. Some recent examples are presented in Table 1. All studies consider the plot size or the associated housing density [2,8,14,[22][23][24][25]. The size [14,22,24,25] and height [8,25] of the residential buildings and garden space [8,14,24] are popular indicators as well. Gardens include vegetated areas used for urban agricultural activities, a land use type common in SSA [26]. Several studies [2,22,24] rely on a pixel-by-pixel manual classification which although usually highly accurate, is tedious and subject to visual interpretation. When classifying urban residential areas, a drawback of advanced remote sensing methods using high-resolution imagery (e.g., [8,23,25]) is that such methods are less popular within the social sciences community [27]. Especially in the context of developing cities, the cost of high-resolution imagery can be a limiting factor for local institutions [22]. Even though remote sensing methods are geographically adequate for city-wide mapping of residential typologies, they mask underlying social contexts.
Traditionally, socioeconomic data and remotely sensed classifications have been analyzed separately and on different spatial scales. Economic segregation is obvious in many cities when taking dwelling exterior and neighborhood layout into account, e.g., via remote sensing of residential land use (Table 1). This strategy adapts the simple principle that wealthier households reside in larger homes on bigger plots [2,14,22,25]. However, differences between households go beyond these externally visible characteristics. For instance, a dissimilarity may occur between the location and home exterior of newly migrated families compared to those who are settled-even if their income is similar. More established or educated households have had the time to gain location-specific knowledge [28] and adapt (e.g., new roofing on current house, relocate to larger plot) and/or enlarge their dwelling exterior over the years. Thus, the discrepancy between the residential exterior and the household living inside, is parallel with the distinction between land cover and land use [12].
Hence, a knowledge gap exists regarding the socioeconomic dynamics underlying the observed differences in residential land use in developing cities. To find answers to the issues associated with segregation in rapidly growing SSA cities, we must consider the full set of socioeconomic parameters. The objective of this study is to develop an intuitive method to upscale socioeconomic survey information through a remote sensing approach. This way, we aim to obtain a better understanding of the socioeconomic layout in SSA cities, taking Kampala (Uganda) as a case study. The research questions that will be answered are: • Which socioeconomic groups are present in the city? • How can household surveys be upscaled using remote sensing to locate where socioeconomic groups are residing in the greater metropolitan area?
The hypothesis of this study is that socioeconomic household characteristics can be important predictors of residential choice [14,25]. For this reason, socioeconomic segregation needs to be analyzed based on a combination of: (i) detailed socioeconomic information, via spatially distributed household surveys, and (ii) neighborhood and dwelling characteristics, via remote sensing [12]. Through the proposed methodology, we test the often-made assumption that visible patterns of segregation on remotely sensed imagery reflect the socioeconomic groups living there [25]. This research shows the complementarity in city-wide analyses of different levels of spatial information such as census data, household surveys and remotely sensed classifications [29]. By mapping socioeconomic dynamics in the city, we hope to provide information to urban planners and policy-makers who are dealing with the difficulties of equitably adapting to these dynamics and the associated problems [7].

Study Area: The Greater Kampala Metropolitan Area
Aiming at analyzing socioeconomic dynamics, the Greater Kampala Metropolitan Area (GKMA) in Uganda has been selected as a case study (Figure 1). The study area is positioned at the northern shore of Lake Victoria and encompasses an area of about 1026 km 2 . The smallest administrative unit (SAU) for which census data are available in Kampala is the parish. The GKMA as presented in Figure 1, comprises 171 parishes. Kampala is a representative case for many SSA cities, because it is characterized by a recent and very rapid population increase of over 5% per year, resulting in social segregation and an inefficient city layout [2,30]. In 2015, the inner city was inhabited by over 1.9 million people, which is more than two-fold the population in 1995 [1]. Recent population estimates for the entire GKMA depend on how the area is demarcated, and range from 3.13 million [31] to over 4 million [32]. Due to the rapid horizontal expansion of the capital, the urban agglomeration of Kampala now includes former satellite towns such as Mukono and Entebbe [33] (Figure 1). The exponential growth of Kampala's population is likely to continue in the future for two reasons. First, many rural dwellers are attracted to employment opportunities in the capital city and decide to migrate [34,35]. Second, with Uganda's total fertility rate at 5.91 children per woman [36], natural population growth will add to the population increase of the city. Kampala is situated in a hilly area. Wealthy inhabitants generally choose to build their homes on hilltops while slums have developed in low-lying, flood prone wetlands [2,7,9]. The informal inhabitants of these wetland areas often depend on urban farming practices for their livelihood. It is often assumed these agricultural activities are a remnant of the "rural lifestyle" new migrants to the city used to have [37]. Even though Kampala's inhabitants and their livelihoods have been studied from multiple perspectives, household typologies have traditionally focused on income and ownership, as well as the involvement in urban agricultural activities [2,14,37]. In this paper, we argue that while these factors are imperative, it is essential to include a broad spectrum of socioeconomic variables to create household typologies.

Household Surveys
To obtain a better understanding of socioeconomic patterns and their location characteristics, 541 households were interviewed based on a convenience sample in the GKMA in 2019. Information was gathered on a total of 2487 individuals within the households. A mixed team of six interviewers from KU Leuven and Makerere University carried out the surveys. The households were approached at their homes in 15 contrasting parishes (the SAU) of the GKMA (Figure 1). We aimed to survey households at SAU that are contrasting both in terms of socioeconomic dynamics, as well as their geographic location within the GKMA. For more information on the sampling strategy, please refer to Appendix B. Informed verbal consent was registered from the interviewee, after which a 50-min survey was conducted. Survey responses and point locations were collected digitally on a mobile phone device or tablet, using the Open Data Kit Collect application (version 1.24.1). The present study uses three subsections of the full household survey: household characteristics, neighborhood characteristics, and income and ownership. The survey protocol was approved on 19 June 2019 (approval number G-2019 06 1664) by the KU Leuven Social and Societal Ethics Committee (SMEC).

Socioeconomic Survey Data Clustering
To convert the sampled household survey data into socioeconomic representations for the whole population of the GKMA, a disproportional upscaling method is applied. Households are clustered into distinct, more homogeneous groups rather than considering the sampled averages representative for all inhabitants [38]. This clustering of the data was carried out in R (version 3.6.2) using the k-prototypes algorithm, as this was developed specifically for clustering of mixed numerical and categorical data [39]. The "clustMixType" package for R [40] was used to create socioeconomic clusters (SEC) based on the survey data. Since this clustering tool does not tolerate missing values, empty values were imputed through a multiple imputation method using the "missForest" package for R [41]. 16 surveys were excluded from clustering, as these were inadequately sampled. These cases were not systematically different from sampled cases in demographics. Appendix C contains a detailed missing data analysis, concluding that data are missing at random. This way, 525 households were included in the k-prototypes clustering. 71 socioeconomic variables (summarized in Table 2) were included that can be linked to residential choice in Kampala [14], and, thus, the dwelling exterior. These variables can be subdivided into three collections: household characteristics, neighborhood characteristics and variables related to income and ownership [13]. In absence of detailed census information, four clusters were chosen because this enables comparison and validation with recent research on the socioeconomic segregation of Kampala [2,14,16]. The output of the clustering method will consequently assign a SEC to each of the interviewed households. To avoid the result of a small, non-representative cluster (e.g., of a wealthy elite), the k-prototypes method parameters were calibrated to produce four similarly sized clusters. A Welch two-sample t-test is carried out in R for each cluster compared to the rest of the dataset to visualize cluster differences. Categorical variables were transformed to numerical for t-value calculations: e.g., yes (1), no (0); or agree (1), neutral (0.5) and disagree (0).

Remote Sensing Classification of Residential BUA
A remote sensing classification was carried out for the entire study area to upscale the socioeconomic data gathered at household level. Therefore, unlike conventional remote sensing analyses, we aimed at distinguishing four land use classes within the residential built-up area (BUA): villa housing, large housing, small housing and slum housing ( Figure 2). The villa housing is characterized by large, clearly demarcated plots of over 1500 m 2 on a regular road grid. The villas have a BUA of over 250 m 2 . All villa housing plots have a garden with vegetation cover, and some have a swimming pool on site. This housing typology may be confused with some of the larger hotels and resorts in touristic areas. The large housing consists of residences of about 150 m 2 with a semi-regular street layout. These homes usually have a garden, with average plot sizes over 400 m 2 . The small housing residential class consists of smaller dwellings of under 100 m 2 . Nonetheless, these residences still have access to a small garden area with plots of about 200 m 2 . The slum housing typology can be considered to be non-permanent structures of about 50 m 2 or less, with a highly irregular street layout. Slum housing rarely has access to a nearby garden. These four residential BUA classes respectively correspond to housing typology types A, B, C and D as recently defined by [14]. Alongside these four categories, we defined "industrial", "water" and "other". The latter mainly comprises of (non-garden) vegetation including grass, forest, and swamps.
We conducted a comparative study of various classification methods that are readily available in popular GIS software packages. The main criterion in this comparative study was the classifier accuracy for the residential built-up classes. Examined methods include principal components analysis, ISO cluster unsupervised classification, and maximum likelihood supervised classification. The satellite imagery should be appropriate for the desired purpose and study area. Although typically, higher-resolution satellite imagery results in better classifications, the image resolution should fit the average plot size in the study area [12], to include the possible influence of garden vegetation and swimming pools. Open access imagery is recommended for analyses targeting urban development issues. In addition, the time at which the image was taken should closely correspond to the time at which the household surveys and most recent national census were carried out. For these reasons, all methods were tested on recent Sentinel-2 and Landsat-8 imagery. The pixel-based, maximum likelihood (ML) supervised classifier performed most satisfactory for the purposes of this research. The ML method is based on the Bayesian probability theory using the variance and covariance data of the training pixel signatures to make an estimation of the probability that a pixel belongs to a certain class [42]. The prior probability used corresponds to the amount of training pixels selected for each class (sample probability). ML is a relatively simple and well-known method in remote sensing, and was applied for a Landsat-8 image in ArcGIS (version 10.7.1). The Landsat-8 satellite image covering the GKMA (path 171, row 60) was chosen based on its recent date (February 2020), minimal cloud cover (< 3%) and suitable resolution (30 m). In addition, the Landsat-8 images are available open-source and atmospherically corrected via USGS.
The imagery used for visual selection of training and validation pixels are Maxar, available via Google Earth or the ArcGIS imagery base layer, of 0.3 m resolution ( Figure 2). In addition, Google Streetview and field observations were consulted to ensure the quality of the training and validation data. The training data are small polygons of approximately 85 pixels each, for which the land cover class is known. Figure 2 shows how typical variations in residential BUA can be detected and selected as training or validation areas with the aid of the Maxar high-resolution imagery. Adhering to the guidelines by [42], 1829, 4894, 4307, and 3905 training pixels were selected on the Landsat-8 image for the slum, small, large, and villa residential housing classes, respectively. For validation, 120 pixels per land use class outside of the training sites were selected. Based on the validation data, we generated an error matrix. Classification accuracy was assessed using the percentage correctly classified (PCC) and the Kappa Index of Agreement (KIA) [43]. In addition, we generated a confidence raster showing the spatial distribution of the ML classification certainty.

Upscaling Socioeconomic Clustered Data Using the Remote Sensing Classification
Finally, a fuzzy error matrix (Table 3) was generated to evaluate to what extent the ML classification correlates to the locations and SEC of the surveyed households. In a traditional error matrix, classified pixels are compared to validation data on the diagonal only. However, the land use classes described in this study are not clear-cut. The spectral signatures of these classes will not be clearly distinguishable as these are all built-up, residential land use. Therefore, for certain residential groups, we consider one adjacent class to be "correctly classified" or (potentially) matching in the fuzzy matrix [43]. For example, without any knowledge of their dwelling exterior, a household in a high income cluster would likely reside in either a villa (i = 1) or in large housing (i = 2). In this case, we assume more established households have more location-specific knowledge [28] and have had the time to adapt or enlarge their home exterior. Depending on the geographical context, the proposed (potentially) matching pixels in the fuzzy error matrix in Table 3 can be adapted to represent the level of visual housing segregation. Table 3. Schematic representation of the fuzzy error matrix comparing clustered survey findings and remotely sensed classification (adapted from [43]). *: Matching pixels, **: Potentially matching pixels. n 11 * n 12 n 13 n 1k n 1+ 2 n 21 * n 22 * n 23 ** n 2k n 2+ 3 n 31 n 32 * n 33 * n 3k ** n 3+ k n k1 n k2 ** n k3 * n kk * n k+ Column Total n +j n +1 n +2 n +3 n +k n The UBOS [31] national census population per parish (density shown in Figure 1) was merged with an OpenStreetMap spatial layer to geographically display the population at parish level for the entire GKMA. This parish-level population number is then reclassified based on the error matrix to display what percentage of the parish population belongs to each SEC. Depending on the case study and the amount of available survey locations, it can be decided to reclassify the population at the SAU using either the entire error matrix, or only those cells considered to be (potentially) matching. In this study, all matching or potentially matching values were considered for this reclassification. Wealthier households in villas reside on larger plots of land, and therefore have a lower population density. Hence, the average plot sizes for each residential housing type in Kampala as described in Section 2.4. were used for conversion. Figure 3 summarizes the workflow for this study.

Socioeconomic Clustering
Using the k-prototypes method, 525 surveyed households in Kampala are subdivided into four distinct SEC, subdivided by income ("high", "middle", "low") and embeddedness in Kampala ("established" versus "newcomers"). Figure 4 displays the t-values for a selection of 10 variables (out of a total of 71) which capture the household typology. Higher t-values are linked to a higher standard of living as the inverse of "undesirable" variables (flooding prevalence, distance to nearest water source) is shown. The first group consists of 143 households, the "established high" (EH). They are the most affluent inhabitants of the GKMA with a median monthly income per person of 207,000 UGX. EH households are the largest of the dataset, with an average of 6.4 persons. 64% of EH households indicate they belong to the majority tribe in Kampala, the Muganda. These households are well-established within the urban dynamic: most spent over 15 years in Kampala and 78% engage in urban agriculture for (part of) their livelihood. They live in neighborhoods with a good reputation and low flooding risk. Most households in this first group are highly educated at tertiary level, which is reflected in relatively long commuting times to their place of employment. 89% own a smartphone, and nearly everyone in this group has water tapped inside their home.
The second group can be defined as the "established low" (EL) class, containing 97 households. This group is characterized by a low median income per person of 87,500 UGX and they have resided in Kampala for a long time (median 19 years) with relatively large household sizes. 77% of EL households are Muganda. They live in neighborhoods with an average reputation and flooding prevalence. This group is educated at lower secondary level, has poor access to technology (30% own a smartphone) and fresh water either inside their home or at less than 20 m distance. For their livelihood, this group lives rather near their workplace and the majority (72%) engages in urban agriculture.
With a median income of 131,000 UGX, the third SEC of 111 households is referred to as "newcomers middle" (NM). This SEC entails mostly new migrants, with the majority having resided in Kampala for less than 10 years. This is reflected by their limited engagement in urban agriculture (28%). NM households are educated at higher secondary level, and nearly all (95%) own a smartphone. Nonetheless, despite their middle-range income, they live in neighborhoods with a poor reputation. This suggests that due to their recent migration to Kampala, their choice of place of residence was limited [14]. These families are comparatively small (median of 4 persons), and usually do not have a water tap inside their residence. The majority tribe are underrepresented in this group with only 32% of households.
Lastly, the "newcomers low" (NL), a cluster with 174 households, are similar to the third group with some exceptions. Most (60%) are of the majority Muganda tribe. Their monthly household income is low, with a median of 99,500 UGX, while their households are similar in size to the NM. The individuals do not commute far (median 22 minutes), suggesting they live on temporary employment daily. In contrast with the NM, the NL are usually educated at primary or lower secondary level and only 24% owns a smartphone. Figure 5a shows the result of the ML classification, distinguishing four land use classes within the residential BUA: villas, large housing, small housing, and slums. The emerging pattern shows the slum housing is chiefly located around the central business district (CBD). Although some clusters of villas exist on the hilltops in the city, most are located on the outskirts or towards the touristic areas in Entebbe. In many locations, a gradient is visible where a cluster of slum housing is found adjacent to small residences, which in turn neighbor areas with large housing, which are located next to villas.  The suburban areas were classified with the highest confidence. Some smaller parishes within the inner city have a low mean confidence (< 0.5), which can be attributed to the presence of undefined land use classes (currently "other") such as parks and infrastructure. The mean confidence for the entire study area is 0.7. Furthermore, the ML result was validated with 120 control points per class and resulted in a PCC of 61.5% and a KIA of 0.575, i.e., when taking into account only the diagonal of the classic error matrix (Appendix A). The slum housing classification performed the best, while the classification of the large, small, and villa residential classes is more ambiguous. The relatively low overall PCC and KIA values are largely due to this fuzziness in the middle-to-high income residential BUA. This is not surprising considering that with large gardens and swimming pools, the villa housing class consists of the largest spectral heterogeneity.

Residential Land Use Classification
The correlation between the SEC resulting from the survey data and the output of the ML classification is displayed as a fuzzy error matrix in Table 4. Values indicated with * (PCC 52.7%; KIA 0.377) can be considered to be matching based on income group and "newcomer" or "established" status. When the locations considered to be potentially matching (**) are also included, the correlation accuracy goes up to a PCC of 72.8% and a KIA of 0.594. Producing the same fuzzy error matrix for the income variable only, results in lower accuracies with a PCC 47.5% and KIA 0.328 for the matching values. This further validates this approach, showing that a broad spectrum of socioeconomic variables shows a stronger upscaling potential than income only.

Socioeconomic Population Maps
The matching and potentially matching values in Table 4 were used to reclassify the parish population in the GKMA according to these proportions. In other words, we upscale the household survey findings to represent a portion of the census population number at each SAU. We assume the population density is different for each housing type as described in Section 2.4.: small homes are located on plots four times as large as slums, large housing plots are eight times as large as slums, and villa plots are 30 times larger than slums. Figure 6 shows the resulting percentage of the parish population taken up by each SEC. The EH is the smallest group: they never consist of more than 25% of the total population of a parish. They live in the direct outskirts of Kampala city, as well as on some of the hillier areas within the inner city. The touristic areas around Entebbe and the newly constructed motorway are also classified as having relatively many (> 20%) EH inhabitants, though this could be due to the presence of large hotels. The EL, on the other hand, are spatially very well spread out over the GKMA, with about one fourth of the population of each parish being part of this SEC. The NM are also scattered around Kampala and its surroundings, but make up a slightly larger share of the parish population on the edges of the GKMA, along with the EL. The NL are present everywhere as well, but are most highly concentrated (> 45%) in the densely inhabited inner city of Kampala.
Most parishes in the densely inhabited inner city of Kampala are thus inhabited predominantly by NL, with an approximately even share of EL and NM population. This implies that either the city of Kampala is not as socioeconomically segregated as hypothesized, or that this segregation occurs at the sub-parish scale. Figure 7 shows how the ML classifier performs at sub-parish parish scale for two adjacent parishes in central Kampala: Nsambya Central and Makindye I. By visual comparison, the classifier performs well. For instance, larger residences with gardens, visible on the Maxar imagery throughout Nsambya Central and in the southern areas of Makindye I are correctly detected as villa housing. Combined, 100 georeferenced surveys were carried out in these parishes. Table 5 shows how these survey points correlate with the ML classifier. According to this fuzzy error matrix, the classification performs best for small housing or slum-type residential areas, as these are the dominant land use in the city center.

Which Socioeconomic Groups Are Present in the City?
The household survey responses were clustered into four socioeconomic groups (Figure 4): the "established high" (EH), "established low" (EL), "newcomers middle" (NM) and "newcomers low" (NL). The Agent-based simulation of Social Segregation and Urban Expansion (ASSURE) [2,16] also defines SEC in Kampala, though their focus is mainly on income. This section will discuss the similarities and differences between the outcomes both studies. The EH in this study are similar to their "rich" group in terms of income, livelihood, and household size. The EL group is equivalent to the "poor" group. Although their income is low, the EL are rooted within the urban atmosphere of Kampala. Compared to their [2,16] clustering, the NM group is the most similar to their "middle income" respondents. A large part of the income of NM might be sent to other areas as remittances, explaining their relatively high income but low standards of living [45]. Nonetheless, as information on remittances was not included in the household survey, this hypothesis cannot be confirmed. The NL can be compared to the "extreme poor" cluster in [2], due to their low income and low standard of living.
Next, we upscaled the socioeconomic clustering to the entire GKMA at the level of the SAU (parish) in Figure 6. To validate this approach, we visually compare the result in Figure 6 with the socioeconomic segregation map by [2] (p. 2385). This comparison shows a clear parallel: the poor and extreme poor SEC are the largest in most parishes, while the wealthier urban residents tend to locate on hilltops in the inner city, or in well-connected suburban areas [2,14]. Some confusion is present between the locations of the NM ("middle") and NL ("extreme poor"). The NL are concentrated mainly in the inner city. This is not surprising, as other research in SSA cities show new migrants who can afford it tend to migrate towards the city center first, in search of employment. However, because Kampala is a polycentric city, newly migrated households (of all income groups) are observed to be spatially spread out, thus not displaying any discernible "spaces of arrival" [14,46]. The research project by [2,16] gathered data between 2010 and 2013, which in a rapidly growing city as Kampala means these dissimilarities could be attributed to changing social dynamics. However, as their clusters are defined somewhat differently, and they do not specify how exactly they estimate the locations of these socioeconomic groups, this comparison should be interpreted with caution. As the results for Kampala closely correspond to preceding research in the GKMA, the rapid horizontal expansion of the city is likely to continue via the same socioeconomic dynamics in the future.

How Can Household Surveys Be Upscaled Using Remote Sensing to Locate Where Socioeconomic Groups Are Residing in the Greater Metropolitan Area?
The socioeconomic typologies found by clustering the household survey responses were upscaled to the entire metropolitan population of Kampala by directly comparing their locations to a ML classification of Landsat-8 imagery. The workflow is depicted in Figure 3. This comparison was by means of a fuzzy error matrix, where only the (potentially) matching pixels were used for upscaling. To do so, an estimate is needed for the amount of space (e.g., plot areas) taken up by each housing type. The results in the fuzzy error matrix (Table 4) confirm our hypothesis, indicating that household income is not the only predictor of housing infrastructure in the GKMA [14]. The newcomer groups show the strongest correlation with the small housing and slum housing locations. Oddly, the EH are occasionally located within areas classified as slum housing. Nonetheless, the limited number of household locations that coincide with larger homes (7.3%) or villas (5.3%), makes it challenging to judge the upscaling method for the EH. To an extent, the fuzzy error matrix in Table 4 validates the hypothesis that remotely sensed patterns of segregation are a reflection of the SEC residing there. However, we conclude that socioeconomic household characteristics are much less spatially clustered in the GKMA than the housing typologies. Our findings therefore confirm the results of previous studies in Kampala [14,16].
The added value of the method presented in this study is in its combination of two relatively well-known and straightforward methods. Both in the socioeconomic clustering of household surveys as in the residential land use classification, it is clear that segregation in Kampala occurs at sub-parish level (Figure 7). When urban land use classifications are compared with socioeconomic data points, this usually occurs at a spatially aggregated level, e.g., the SAU. This is due to the availability of socioeconomic information often being limited to census data, which cannot be disclosed at household level [12]. In the context of urban SSA, detailed socioeconomic census data are often absent at the scale of the SAU. When dynamics of segregation take place at a finer spatial resolution than the SAU, it is, therefore, favorable to apply a method of pixel-based validation.
Clustering analyses of survey data are common in both social sciences [38] and urban systems modelling studies [2]. Likewise, remotely sensed population base maps can be used for urban planning and resource allocation [47]. Combining these methods, however, should be intuitive and straightforward, so that its application is justified in many fields. Most GIS software provide uncomplicated tools to compare point values with underlying raster cell values, which can be used to create a fuzzy error matrix. We propose this method to avoid that modelling efforts rely only on estimations [2] or interpolations at local level [14] for their socioeconomic population base maps. An additional advantage is that the upscaling of household surveys to wider study areas would facilitate the availability of socioeconomic population base maps, which supersedes the need to share extensive datasets containing sensitive information.

Limitations
There are limitations to the presented research. We use quantitative data to demonstrate relationships between many variables, yet this does not explain possible causalities. Further research will be required to confirm or discard propositions as to why these relationships are found. Additionally, the SEC were estimated based on a convenience sample of households, and therefore the results of the cluster analysis should not be interpreted as generalizable to other contexts. Hence, future studies should replicate the cluster analysis with a larger probability sample.
This study would benefit from more geographically spread out household survey locations as this would increase the accuracy of the ML comparison. Preferably, the geographical reach of the survey sample would cover the entire study area. Another spatial constraint of this study is that household survey sites were included as point locations, rather than as polygons, as done by [48]. As with increased household surveys, including the property areas of sampled households could have resulted in a larger number of validation pixels. However, this was indirectly dealt with as Landsat-8 imagery of 30m resolution (1 pixel = 900 m 2 ) was used, which more than reflects the area of 331 m 2 taken up by an average household in the GKMA [2].
As mentioned, different remote sensing classification methods or household survey clustering methods can be combined, which might result in different socioeconomic population base maps. A drawback of the proposed method is that there are errors associated with both the remotely sensed land use classification, and with the socioeconomic clustering of surveys. Combining both methods implies that there is an inevitable error propagation in the resulting socioeconomic layout of the city. With careful selection of training pixels and adequate spatial survey sampling these errors can be minimized. Nonetheless, the output should be interpreted as an overall pattern of the urban socioeconomic geography rather than as an accurate measure of segregation.

Conclusions
This paper proposes an intuitive methodology to directly compare and combine household-level socioeconomic survey data with a remotely sensed classification of built-up residential areas. We demonstrate that a combination of survey and remotely sensed data is more powerful than either of these approaches in isolation. Although we applied a k-prototypes clustering method to the survey responses and a maximum likelihood classifier to Landsat-8 satellite imagery, future work could evaluate a different combination of techniques to further validate this approach. Upscaling the survey findings to the GKMA suggests socioeconomic segregation in the city occurs at sub-parish level. In the inner city, the largest group are the "newcomers low", while share of the parish population belonging to the "established low" or "newcomers middle" clusters is often roughly equal. This stresses the need for residential land use classifications that are validated with survey information at spatial resolutions that exceed the SAU. The results for Kampala correspond to previous research carried out in the area, suggesting the rapid horizontal expansion of the city is continuing via the same socioeconomic dynamics. Unless policy action follows these insights, a business-as-usual scenario is therefore likely for Kampala in future analysis or modelling efforts. where: • n is the sample size (541 households, with 2487 individuals). • p is the population proportion (assumed at 0.5 for complete uncertainty). • Z the Z-score (1.96 for a confidence interval of 95%). • e is the error margin (1.97%).
Within the SAU, households were selected using a snowball strategy where a local council representative, after giving their informed consent, led the interviewers to households and assisted in explaining the purpose of the study. For this reason, households within a selected SAU all needed to be within walking distance from each other. The final sample size was 541 households. Information was gathered on a total of 2487 individuals within the households. This sample size calculation method is based on similar approaches used for matching land use with socioeconomic factors [50] and food security [51].

Appendix C
Missing data were assessed using base functions and the missForest [41] and irr [52] packages in R statistical software. Overall, 3.04% of observations were missing from the dataset used for cluster analysis. Missing data were imputed through a multiple imputation method. All variables used in the clustering analysis were used to impute missing data. A nonparametric random forests multiple imputation method, suitable for mixed numeric and categorical data, was implemented in the R package missForest [41]. The dataset reached the stopping criterion after 6 iterations. The imputed values showed high coherence with a low normalized root mean squared error for continuous variables (NRMSE = 0.51) and a low proportion of falsely classified categorical variables (PFC = 0.12).
Missing data from participants were assessed to decide whether each participant had been adequately sampled. Missing data analysis was performed as part of a larger data cleaning procedure, in which logic tests were applied to assess the reliability of participant responses (for example, if household-level information did not match household roster data). The total proportion of missing data per participant was then calculated to include both non-responses and responses deemed ineligible due to testing logic.
Participants had an average of 4.25% missing data (SD = 10.91%, range = 0-95%). A threshold of 37% missing data per participant was selected for further investigation. This threshold was chosen to minimize the number of participants that would be excluded, while balancing adequate sampling per participant (see Figure A1). Figure A1. Number of participants that would be excluded at each missing data threshold. Note change point at 37% missing data.
Sixteen participants (2.96% of the total sample of 541 participants) had more than 37% missing data. These cases were examined to discover whether anything unusual had happened during the recruitment and/or survey administration. On examination it appeared that these participants had abandoned the survey partway, perhaps due to its length. Since these participants had systematically answered the initial survey questions and not the later ones, we concluded that these cases were inadequately sampled. These cases were therefore removed from the dataset. We do not expect that removing these cases should have any effect on the substantive results of this study for two reasons. First, they represent a small percentage of the overall sample (< 3%). Second, these cases were not systematically different from sampled cases in demographics: missing data per participant was not significantly predicted by language, F (28, 507) = 1.31, p = 0.138, income, F (1, 434) = 0.43, p = 0.514, or household size, F (1, 539) = 0.59, p = 0.441. Therefore, removing these cases would not bias the parameters estimated from the data in the analysis, and should have no effect on the validity of the study results.
Missing data from variables were assessed to decide whether each variable had been adequately assessed across the sample. Variables had an average of 3.08% missing data (SD = 3.60%, range = 0-19.41%). The highest percentages of missing data were found for average commuting time (12.01%) and monthly income (19.41%). A series of separate variance t-tests did not reveal any systematic patterns between the missing data in these variables and the values of other variables in the dataset related to wealth or travel method, including ownership of a car, van, or motorcycle (min. p = 0.110, p = 0.695, and p = 747, respectively). Results therefore suggest that these data were missing at random.
Based on the results of the separate variances t-tests, and the low overall proportion of missing data, we concluded that subsequent to removing the cases with >37% missing data, there should be no relationship between the probability of missing data, and the expected value of that missing data. Therefore, we conclude that the data are missing at random; that is, the treatment of missing values should not present a problem for the validity of the resulting analysis estimates.