The Human Influence Experiment ( Part 2 ) : Guidelines for Improved Mapping of Local Climate Zones Using a Supervised Classification

Since 2012, Local Climate Zones (LCZ) have been used for numerous studies related to urban environment. In 2015, this use amplified because a method to map urban areas in LCZs was introduced by the World Urban Database and Access Portal Tools (WUDAPT). However in 2017, the first HUMan INfluence EXperiment showed that these maps often have poor or low quality. Since the maps are used in different applications such as urban modelling and land use/land cover change studies, it is of the utmost importance to improve mapping accuracies and a second experiment was launched. In HUMINEX 2.0, the focus lies on providing guidelines on the use of the mapping protocol based on the results of both HUMINEX 1.0 and 2.0. The results showed that: (1) it is important to follow the mapping protocol as strictly as possible, (2) a reasonable amount of time should be spent on the mapping procedure, (3) all users should perform a driving test, and (4) training area sets should be stored in the WUDAPT database for other users.


Introduction
Mapping of urban areas in relation to their urban climate is gaining interest in the global research community [1].Even though most urban climate models require a detailed description of the urban environment, until today, no global urban classification scheme useful for urban climate exists.In 2012, the Local Climate Zone (LCZ) scheme was introduced by Stewart and Oke [2].This scheme consist of 17 zones, divided into ten built and seven natural land cover types (Figure 1).These zones portray a unique air temperature regime at screen height, under similar atmospheric conditions [3] and could thus serve as a global urban classification scheme.Before 2015, the LCZ scheme was mainly used as a conceptual framework to evaluate in-situ measurement sites related to urban heat island research [2].In 2015, a method [4,5] was presented within the World Urban Database and Access Portal Tools (WUDAPT) initiative, to classify major urban areas into spatially explicit LCZ maps and gather information on the internal structure and texture of cities [4,6,7].The data gathering process is organized into a hierarchy based on the level of detail.The default WUDAPT level-0 fundamentally relies on supervised classification of Landsat satellite scenes into LCZ types based on training areas (TAs) that are created by urban experts who identify parts of the urban landscape that exemplify each type present in a city [4].Thus, WUDAPT is an example of crowd-sourcing geographic information, also referred to as volunteered geographic information [8] and citizen science, among other terms related to user-generated content [9].  2 in Stewart and Oke [2], text shortened, icons reworked) B: Buildings; C: cover; M: materials; F: function; Tall: >10 stories, Mid-rise: 3-9 stories, Low: 1-3 stories.
At the 10th International Conference on Urban Climate in 2018, it became clear that LCZ maps are now regarded as a global reference for urban land cover descriptions [10].Since LCZ maps are intended for and already used in a range of different applications, such as climate models at various scales [11][12][13][14][15][16][17][18][19][20], land use change investigations [6], or the characterization of several hundreds of crowd-sourced citizen-weather stations in terms of their local-scale surroundings [21], there is a clear need for highly accurate classification results [22].However, it is unclear what determines the final quality of a LCZ map derived with the WUDAPT level-0 methodology.Hence in 2017, the Human Influence Experiment (HUMINEX) was introduced to investigate the variability of the quality of LCZ maps, produced by different individuals using the WUDAPT methodology [23].It aimed at identifying how large discrepancies between individual LCZ maps can be for a given city or region.In HUMINEX 1.0, about 120 students from six universities classified a total of twelve cities.The experiment provided several interesting and relevant insights, namely that some specific LCZ classes can be easily identified, or that the iterative scheme set up in the WUDAPT methodology is justified [23].However, the results of HUMINEX 1.0 also clearly highlighted that LCZ maps are often of overall poor to moderate quality.Best quality of LCZ maps for different cities was observed when multiple training sets from different participants were combined, thus indicating a certain "wisdom of the crowd" [23].Beside these findings in HUMINEX 1.0, some deficiencies with the original setup caused problems for meta data analysis and a number of questions remained unanswered.HUMINEX was thus continued in a second phase, called HUMINEX 2.0.
While during HUMINEX 1.0, the participating institutions carried out their own introduction to the topic for their individual courses, HUMINEX 2.0 included a standardized introduction to the topic across participating institutions, along with improved course materials distributed to the participants (http://www.wudapt.org/huminex-2-0/Moreover, in HUMINEX 1.0 most participants lived in the city that they classified and hence, multiple cities were classified.HUMINEX 2.0 focused only on one city: Berlin, Germany.This city and surrounding area was selected for this case study due to the variety of LCZ classes present, ranging from natural landscapes to densely built-up urban areas.Additionally, many participants in HUMINEX 1.0 had never carried out a supervised classification before and were unfamiliar with the LCZ scheme, possibly contributing to the poor or moderate quality of LCZ maps [23].To investigate if such a deficiency could be overcome, a 'driving test' for LCZ classification with aerial imagery was developed and introduced to half of the participants in HUMINEX 2.0.Finally, Van Coillie et al. [24] found that for remote sensing image interpretation, operator performance is mainly determined by demographic, non-cognitive and cognitive personality factors, and less by external and technical factors.Hence, for HUMINEX 2.0 we aimed at identifying similar results and to see if individuals' psychological structures might influence the assumption and the classification of the landscape.
The present study aims at presenting results of the second phase of HUMINEX, which mainly aimed at overcoming the limitations of HUMINEX 1.0.Moreover, based on the results of both phases, this study aims at providing guidelines for operators of the WUDAPT level-0 methodology to obtain LCZ maps of high quality.Specifically, we focus on the following research questions: 1.
Can the quality of LCZ training areas be assessed from operator self-assessment or from the training areas themselves? 2.
Does previous knowledge on LCZ given by the driving test help to correctly classify LCZs? 3.
How much does the personality of the operator influence the classification quality?

Layout of the Experiment
The LCZ workflow [2,4,23] was provided online and consisted of a set of training materials that were used in guided student exercises.First, the students (in the remainder of the paper also referred to as participants) were introduced to the LCZ scheme and the WUDAPT framework [4].Subsequently, they were provided with the software and the workflow of the exercise.
Each participant defined a TA set for Berlin according to the protocol developed by WUDAPT, i.e., "to be of a size of approximately 1 km 2 ; to be as homogeneous as possible; to be compact in shape; and to have sufficient space along the borders with neighbouring LCZ areas" [4].Next to that, in the first round, the TA sets of each LCZ class had to include at least five to ten TA polygons in order to cover the internal variation within the different zones (e.g., for an urban LCZ class the internal variation due to different roof colours/materials).The experiment was set up as a joint effort of several universities who offered the online exercises to their students as part of a geographic information course.All participants were provided with the same training materials (Saga GIS software, website, and papers), which included the LCZ mapping workflow as described in [23], and were asked to perform a LCZ classification with at least three iterations.
Next to the TA sets and LCZ maps, elaborate metadata was collected from each participant in the second phase of the experiment using an online questionnaire.Table 1 provides an overview of the collected metadata, ranging from basic information (e.g., age and gender) to questions relating to human behaviour and personality.Besides, LCZ-and city-specific knowledge was enquired as well as details on TA collection and LCZ classification.
The five principal factors of human personality are often referred to as the Big Five: agreeableness (two questions), conscientiousness (12 questions), emotional stability (neuroticism) (12 questions), extraversion (12 questions) and openness (two questions).The participants were asked to indicate how much they relate to the personality questions using the Likert scale [24].

•
Agreeableness is the willingness to help other people, act in accordance to other people's interests and the degree of co-operative, warm and agreeable traits in an individual.

•
Conscientiousness can be described as the preference to follow rules and schedules, keep engagements, work hard and organize.

•
Participants, which are emotional stable, are characterized by being relaxed and independent, calm, self-confident and self-restrained.

•
Extraversion defines the need for human contact, empathy, assertiveness and the wish to inspire people.

•
Openness measures the degree to which a participant needs intellectual stimulation, change and variety.
In addition, in HUMINEX 2.0, some self-assessment questions were asked, including their assessment of the final LCZ map, their knowledge of the city being mapped, and their image classification experience.A so called "driving test"-performed by 50% of the participants-was introduced.This is a freely available only tool, which can be consulted at http://77.69.20.19/dev/driver/training.php,that provides a dynamic interface to an operator to get familiar with the LCZ scheme before digitizing.After the exercise, a self-reflection questionnaire is presented to each participant to evaluate the dedication of the participants in the experiment.Dedication can be divided into motivation and comparative anxiety.Motivation is defined as: internal and external factors that stimulate desire and energy in people to be continually interested and committed to a job, role or subject, or to make an effort to attain a goal.Comparative anxiety on the other refers to the confidence the participant has in his/her own abilities and performance, and how much concern he/she puts in the performance of others.

Participants and Study Sites
In total 141 students from six universities and one independent contributor participated in HUMINEX 2.0, but only 81 managed to provide images in the correct format.Only 59 performed three or more iterations and filled out the questionnaire (Table 3).Analysis was thus performed on this selection of participants.From the remaining 59 participants, six did not provide their years of study, how they rate their own competence and their gender.The other 53, saw themselves as competent, advanced beginner or novice participants (respectively, 2, 19 and 31).Most of the participants are thus inexperienced in image classifications.Only 5 of the participants lived in Berlin, and 43 of the participants felt less than 26% familiar with the city.88% of the participants had never done a LCZ classification before and 32 of the participants felt they had less than 25% knowledge of the LCZ scheme at time of the classification.17 felt familiar for 25-50%, nine for 50-75% and only one felt 80% knowledgeable on the LCZ scheme.For all the above, 100% equals perfect familiarity/knowledge, 0% equals no familiarity/knowledge.

Analysis and Accuracy Assessment
For all research questions the accuracy of the resulting LCZ maps was assessed using a sample of reference areas previously identified by a LCZ expert familiar with the methodology and the city under study [21,23].For each map, the following two standard accuracy measures were derived (see also Verdonck et al. [22], Bechtel et al. [23]: overall accuracy (OA = percentage of correctly classified pixels); and the F1-score, which represents the arithmetic mean of the class-wise F1 values, which are calculated as the weighted harmonic mean of the user's (UA) and producer's accuracy (PA).The class-wise F1-score (Equation ( 1)) for class i is calculated as [25] and results in a value between 0 and 1: A statistical t-test was performed to evaluate whether a significant difference could be found between the different iterations, the different groups of each personality trait, the participants who did or did not do the driving test and the different groups regarding time investment.The significance threshold was set at 0.05.

Self-Assessment
This experiment investigated the ability of the participants to correctly assess the quality of their maps based on visual interpretation.Figure 2a shows that on average, participants have the tendency to underestimate the accuracy of their LCZ map in the first two iterations and, to overestimate their final mapping result.Figure 2b portrays the average difference between self-estimation and OA for each iteration and the number of participants who respectively over-or underestimated their mapping results.Similar to Figure 2a, the results show that for the first two iterations, most participants underestimated the results, whereas for the last iteration 56% of the participants overestimated the final mapping result.

Information from the Training Areas
Built zones have on average smaller TA surface areas compared to natural zones (Figure 3), since the latter are characterized by a higher degree of homogeneity.What we can learn from the information on surface areas of the TAs is related to the occurrence of representative zones for a class.When zones are underrepresented in a city, it is possible to find some small training areas, but it is likely that the classifier does not pick up on the zone, due to the limited amount of information about this zone.The final LCZ map will contain these zones but the accuracy will be low.Since LCZ maps provide information on the local climate it is important that a local climate can be established in the zones and they should thus be of a certain size (>1 km 2 ).When zones are smaller and embedded in other zones, it is often better to remove the zone from the TA set.The size of the average area of the TA for each zone can be an indication for underrepresented or non-existing LCZs in a city.
In addition to the surface area of TAs, the number of TAs selected for each zone can be an indicator for zones which are hard to classify.In Table 4, mean, min and max number of TAs for each zone are listed.The number of times a zone was not selected (NS) by a participant is also listed.
From this table and Figure 3, it becomes clear that when the number of TAs for a specific zone is low, the representativeness of this TA might be low, inducing lower accuracies.As a user this can be of importance.In fact, inexperienced participants often spent a lot of time searching for representative TAs for all the LCZs even when some of the zones are not even large enough or occur too sparsely in the city to become a LCZ.In addition, the WUDAPT method suggests to digitize compact and simple TA sets.This would translate into TA sets which are characterized by a shape ratio close to one and a low number of vertices.The shape ratio is calculated based on the ratio between surface area and perimeter (Equation ( 2)), considering a circle (shape = 1) is the most compact shape: From Table 4, it is shown that on average the built zones have more compact TAs compared to the natural zones.

Driving Test
HUMINEX 2.0 also focused on the influence of the driving test: do participants perform better after classifying a number of test images?From the resulting participants, 31 performed the driving test.All participants were free to choose the amount of test images to classify.The range of the classification images was quite large: 20-147, with a median/mean of 50/58 images classified.The results on Figure 4 indicate that participants who did the driving test perform better than those who did not, it also shows that improvement with iterations is smaller if the driving test is carried out, but people who carried out the driving test always have higher quality.Overall, it shows that for all participants overall accuracy increased over the iterations regardless of the test.Most importantly the t-test showed a significant difference in the OA after the first iteration for participant who carried out the driving test.
In Figure 5 scatter plots are shown for the number of test images classified compared to the OA.From these figures it is clear that the number of test images does not influence the OA and the participants should thus be free to choose the amount of test images.

Dedication
Based on all the answers in Table 2, a weighted score is given for each dedication trait, three equally large groups for the two traits are defined based on the scores of all participants.Since some of the scores were present in two different groups, the groups were reclassified and thresholds are set [24].Group 1 portrays the participants who shows a low agreement with a certain trait, group 3 shows high agreement with the respective dedication trait (Table 5).
From Figure 6a it seems like motivation did not have an important influence in the mapping process.Participants who had low motivation perform best after three iterations.Participants who did not suffer from comparative anxiety performed better compared to participant who felt pressure to perform well for all three iterations (Figure 6b).This suggests that participants which suffer from high levels of comparative anxiety achieve the lowest map accuracies.It should be noted that no significant differences were found between groups and iterations.However, due to the small resulting sample of participants (59) it is not clear whether dedication had no influence or whether more data is necessary to find statistically significant results.

Personality
The participants were also questioned (Table 2) about their personality.Again all participants were divided into three groups according to their answers (Table 5).The results related to personality are, similarly to the results on dedication, not significantly different, again probably due to the small sample size.
In Figure 7, results are shown for the personality analysis.The trends clearly show that with each iteration overall accuracy rose.For neuroticism it is shown that participants showing medium characteristics on this personality trait had maps with the highest overall accuracies.Both for conscientiousness and extraversion the trends show that participants scoring low for these traits had the highest overall accuracies.Below the boxplots in blue, the number of times participants indicated that a zone was difficult to recognize over all iterations is shown.Especially for the natural zones, the results show a clear relation between the difficulty in recognition and the classification accuracy, e.g., zones A, D and G always resulted in high accuracies.For the built zones this link is not as pronounced.The best classified zones are LCZ 2, 6 and 8.It is however clear that zones which are not present in large enough areas, were recognized as difficult e.g., LCZ 1, 3 and 7. LCZ 9 was one of the most difficult zones to classify.
In addition, Table 4 shows that zones that had low F1 scores according to Figure 8 (LCZs 1, 3, 7 and 10) are characterized by small TAs on average.

Time Investment
In a final step, all participants were asked to report the time they spent on each iteration.The results (summed time investment for all iterations) are evaluated and presented in Figure 9. Figure 9 shows that a medium time investment is the most beneficial for the accuracy results.For our study medium time investment is defined between 240 (4 h) and 330 (5 h and 10 min) minutes in total.

Discussion
The second experiment showed that it is not straightforward to deduce improved guidelines from the metadata and the training areas.As was shown in the results, no significant relation could be found between personality and dedication traits.This could mean that any individual, independently of their background could properly map Local Climate Zones after some training (e.g., through multiple iterations or by using a 'driving test' previous to the mapping).However, the small size of the HUMINEX 2.0 sample does not allow for drawing such conclusions.Experiments with more participants should be performed to get convincing results.Still, results indicate that pressuring participants decreases their ability to produce accurate maps.Encouragingly, the driving test results support that all WUDAPT contributors shall take the test in order to improve their capacity to locate and recognize representative TAs, and hence their mapping accuracies.Self-assessment of the intermediate and final mapping results indicated that participants became better in assessing the quality of their maps after multiple iterations, even though the overall accuracy for all maps remained rather low.The most interesting results from the second experiment are related to the input of the participants on difficult classes and the information embedded within the training areas.If a LCZ is not present or not sufficiently large, participants often indicated this correctly.A similar conclusion can be made for the surface area and the number of TAs, indicating that when LCZs are not present in sufficient large surface areas, it is generally harder for participants to find TAs in large numbers or with an adequate surface area.This will become clear after the first iteration.
For future research it is of importance to include the limitations of this experiment.First, as discussed in the methods, less than 50% of the participants delivered useful data for HUMINEX 2.0.In this respect, it is of utmost importance to improve communication in this type of experiments, otherwise significant result are difficult to come by.Second, due to the fact that less then 10% of

Figure 2 .
Figure 2. (a) Boxplots of self-estimated (SE) and actual (OA) overall accuracy for each iteration.Median OA: red stripe; average OA: white dot; boxplot ends: first and third quantile; whiskers: +/− the 1.5 fold interquartile range on OA values and outliers: grey dots.(b) Difference between self-estimated and actual overall accuracy.Numbers beneath the bars indicate the number of participants who respectively under-or overestimated the mapping accuracy.

Figure 3 .
Figure 3. Number of TA sets according to surface area (km 2 ).

Figure 4 .
Figure 4. Boxplots of overall accuracies for each iteration depending on driving test.Median OA: red stripe; average OA: white dot; boxplot ends: first and third quantile; whiskers: +/− the 1.5-fold interquartile range on OA values and outliers: grey dots.

Figure 5 .
Figure 5. Number of test images versus overall accuracy.

Figure 6 .
Figure 6.Boxplots of overall accuracies depending on iteration and group (a) motivation and (b) comparative anxiety.The x-axis ticks are formatted as X_Y with X the iteration number and Y group number.Median OA: red stripe; average OA: white dot; boxplot ends: first and third quantile; whiskers: +/− the 1.5-fold interquartile range on OA values and outliers: grey dots.

Figure 7 .
Figure 7. Boxplots of overall accuracies depending on iteration and group for extraversion, neuroticism and conscientiousness.The x-axis ticks are formatted as X_Y with X the iteration number and Y group number.Median OA: red stripe; average OA: white dot; boxplot ends: first and third quantile; whiskers: +/− the 1.5-fold interquartile range on OA values and outliers: grey dots.

3. 6 .
Difficulties According to the ParticipantsAfter each iteration participants were asked to indicate which classes were difficult to recognize on the Google Earth images.In Figure8, F1 scores are shown in boxplots for all present LCZs in Berlin for the last iteration.

Figure 8 .
Figure 8. Boxplots for the F1 score for each LCZ after the last iteration, blue values = number of times a participant indicated this zone as difficult to classify over all iterations.Median OA: red stripe; average OA: white dot; boxplot ends: first and third quantile; whiskers: +/− the 1.5-fold interquartile range on OA values and outliers: grey dots.The colour of the boxplots are the LCZ colours when the zones are mapped.

Figure 9 .
Figure 9. Overall accuracy based on overall time investment.Median OA: red stripe; average OA: white dot; boxplot ends: first and third quantile; whiskers: +/− the 1.5-fold interquartile range on OA values and outliers: grey dots.

Table 1 .
[23]data collected from the participants.The allowed answers are provided in brackets (After Bechtel et al.[23]).
participant Number of participants per training area set; highest degree (B.Sc./M.Sc./Ph.D.); total years of study (Number of years); University course; Experience with Image Classification (Self-Estimation ); Age; Gender; City of origin LCZ knowledge Introduction in seminar/course (Yes/No); WUDAPT website visit (Yes/No); study of Stewart and Oke 2012 paper (Yes/No); study of LCZ fact sheets (Yes/No); completion of LCZ Driving test (Yes/No); Numbers of cities classified before (Number of cities); LCZ knowledge self-estimation (0-100%)

Table 2 Table 2 .
Dedication and Personality related questions in the HUMINEX questionaire.
I expect to be among the people who score really well in this exercise; My scores usually do not reflect my true abilities; I very much dislike doing exercises of this type; During the exercise, I found myself thinking of the consequence of failing; During the exercise, I got so nervous I couldn't do as well as I should have.I see myself as dependable, self-disciplined; I see myself as disorganized, careless; Agreeableness I see myself a critical, quarrelsome; I see myself as sympathetic, warm; Openness I see myself as open to new experience, complex; I see myself as conventional, uncreative.

Table 3 .
Participants and cities in the HUMan INfluence EXperiment.

Table 4 .
TA characteristics for area, number, shape and node count of TAs, NS = not selected by a participant.

Table 5 .
Different group sizes and thresholds for each personality and dedication trait