Quantifying the Characteristics of the Local Urban Environment through Geotagged Flickr Photographs and Image Recognition

Urban environments play a crucial role in the design, planning, and management of cities. Recently, as the urban population expands, the ways in which humans interact with their surroundings has evolved, presenting a dynamic distribution in space and time locally and frequently. Therefore, how to better understand the local urban environment and differentiate varying preferences for urban areas has been a big challenge for policymakers. This study leverages geotagged Flickr photographs to quantify characteristics of varying urban areas and exploit the dynamics of areas where more people assemble. An advanced image recognition model is used to extract features from large numbers of images in Inner London within the period 2013–2015. After the integration of characteristics, a series of visualisation techniques are utilised to explore the characteristic differences and their dynamics. We find that urban areas with higher population densities cover more iconic landmarks and leisure zones, while others are more related to daily life scenes. The dynamic results demonstrate that season determines human preferences for travel modes and activity modes. Our study expands the previous literature on the integration of image recognition method and urban perception analytics and provides new insights for stakeholders, who can use these findings as vital evidence for decision making.


Introduction
Urban environments play a crucial role in decision making in terms of the design, planning, and management of cities, which are closely linked with urban functions and their ecosystems. From a social perspective, understanding how humans experience these environments is important for improving urban functions. For example, areas with a large population density and exposure require more attention and in-depth strategies. In recent years, as the urban population has expanded, the ways in which humans interact with their surroundings have evolved [1]. The distribution of the population has changed over space and time, locally and frequently.
Traditional approaches to understanding the urban environment have relied on survey data. These approaches can be used to characterise urban morphology, but they can generate gaps in data collection and data quality that are costly and problematic [2]. Although recently emerging street-level imagery data can overcome these gaps, these data are mostly from Google's own street view fleets, which rarely capture human perceptions of the urban environment. Therefore, challenges remain for policymakers to plan and manage urban environments. In the past few decades, improvements in location technology, such as the global positioning system (GPS), have produced plenty of georeferenced urban data

Previous Studies on Geotagged Images from Social Media
In earlier research, geotagged images from photo sharing social media websites like Flickr, Instagram, and Picasa have been widely utilised to address a series of urban issues. Previous research includes proving the utility of Flickr data in mapping the urban environment [6,15], analysing user behaviour [16,17], facilitating event detection [7, 18,19], travel route recommendations [20,21], places/areas of interest identification [9,22,23], and cultural ecosystem analysis [24]. However, certain information in geotagged photographs is currently underused, such as the content of photographs that were taken in urban areas. The density of photographs can only reflect the popularity of a place or an area but cannot demonstrate the reasons behind those patterns. It is thus necessary to understand if the photographs are relevant to the built environment and what aspects of the city are of greatest interest to people in a specific area [25]. Many studies have used the "tags" attribute of photographs to estimate public interest or capture large-scale events [6,7,18,19]. However, these studies have ignored the key attributes (i.e., photographs) of geotagged Flickr photographs. Furthermore, these tags may not be related to the photographs themselves due to their heterogeneity [26], while several users add no tags at all.

Image Recognition and Urban Analytics
Due to the great improvements to computer vision and deep learning techniques in recent years, a growing number of works have attempted to apply image recognition techniques to understand urban environments, mostly relying on Google Street View (GSV) images. Some harnessed GSV images to measure the perception of safety, class, and uniqueness, thus creating reproducible quantitative measures of urban perceptions and characterising the inequality of different cities [27]. Law and his colleagues combined GSV images with 3D-models generated from the GSV images and used a CNN to classify the street frontages of a front-facing street image in Greater London [28]. Similarly, ref. [29] exploited GSV images to predict the visual quality of the urban environment by comparing ratings based on a survey to train an image classification ConvNet model to predict a façade's quality scale. Some studies have combined GSV images with other imagery datasets to extract parcel features for urban land use classification [11,30]. Naik and his colleagues used an image segmentation approach and support vector regression to monitor neighbourhood changes and correlate socioeconomic characteristics to uncover predictors for the improvement of physical appearance [10]. More recent research developed a deep CNN model, a hierarchical urban forest index, to quantify the amount of vegetation visible based on street-level imagery [2].
However, GSV is not the only image source that can be used to explore the urban environment. Alternatives have also appeared in recent urban studies. For example, images from Flickr, the most prevalent online photograph sharing website, were proven to be usable by [31,32] for land cover classification and validation. Flickr was also exploited in the work of [33], who developed a novel framework for ecosystem service assessment using Google Cloud Vision and hierarchical clustering to analyse the contents of Flickr photographs automatically. Apart from Flickr, "Place Pulse 1.0", a crowdsourced image dataset created by [27], was used to predict the human judgement of a streetscape's safety [34]. The results showed that geotagged imagery combined with neural networks can be used to quantify urban perceptions at a global scale. Other novel image datasets, such as "Scenic-or-not", an online game that crowdsources the ratings of the beauty of geotagged outdoor images, was used to quantify the beauty of outdoor places in the UK through Places365-CNN models [35].
All of these studies demonstrate that geotagged images, in collaboration with image recognition techniques in computer vision, can enable a deeper understanding of our built environments. Meanwhile, a variety of challenges have emerged in these applications. Most studies are based on the global urban environment, while finer urban areas are rarely involved. More importantly, few efforts have associated image recognition with urban change [10,36]. Nevertheless, urban dynamics play an important role in understanding cities, especially for the perceived urban spaces that reflect human interactions with the built environment. Therefore, this study will bridge this research gap to quantify the characteristics of local urban built environments (i.e., UAOIs in this paper) and explore their dynamic patterns.

Recent Approaches to Image Recognition
For about a decade, there have been improvements in the techniques used for image recognition. Some of the most notable techniques include image classification, object detection, and image segmentation. Image classification refers to labelling a photograph based on its content from a fixed set of categories [37]. Image classification gained significant attention when the "AlexNet" model became the winner of the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC-2012), which was a breakthrough that significantly reduced the error rate of images to 15.3% [38]. ILSVRC is an annual contest that aims to automatically estimate the content of photographs from a subset of a large hand-labelled ImageNet dataset (1000 object categories for training and validation). Since then, an increasing number of pre-trained Convolutional Neural Network (CNN) architectures/models have been proposed for the contest, such as GoogleNet, ResNet-152, Inception-v4, etc., which have constantly improved the accuracy of image classification [39][40][41]. Several studies in recent years have used image classification to resolve empirical problems-for example, to retrain one's own image dataset based on pre-trained architecture for prediction [28,29] or to extract features from images through a pre-trained model [32,33,35]. By manually labelling data or using ready-made training data, an image can be identified by a single attribute/label or by multiple features.
More sophisticated techniques include object detection and image segmentation. Compared to image classification, these two methods are able to recognise and locate multiple objects from an image. The former method identifies different sub-images, drawing a bounding box around a recognised object, while the latter partitions an image into objects or parts present with accurate boundaries [42,43]. Recent approaches that have gained wide popularity include Faster R-CNN (Region Convolutional Neural Network) [44] and YOLO (You Only Look Once) [45] for object detection and Mask R-CNN [41] for image segmentation. Unlike image classification tasks that primarily use the ImageNet dataset for training, most object detection and image segmentation tasks are trained on COCO (Common Objects in Context). COCO is a large-scale image dataset, with 80 categories used for object detection and segmentation [46]. These categories mainly include everyday objects, such as vehicles, people, and a few animals. These data have been widely applied in pose estimation [47], medical imaging [48], real-time video surveillance [49], etc. [10].
Considering the suitability and availability of these approaches, a recently introduced and scene-related image classification model, Places365 CNN [50], is used in our study. Compared to other pre-trained CNN models, Places365 CNN corresponds to our motivation to identify scene attributes from a built environment, while other object detection or segmentation models are related to office furniture, vehicles, and animals. More importantly, this model is freely available and well documented [50] but has been rarely used in previous urban analytics [35].

Methods
In the following section, we introduce the Flickr data, study area, and UAOI extraction and subsequently characterise the features of the UAOIs and the outer areas through an image classification model. In addition, a finer time dimension is included to further explore the dynamic characteristics of UAOIs.

Data and UAOI Extraction
Data were collected from Greater London, as Greater London is the capital of, and the largest city in, the United Kingdom, with a population of over 8 million, according to the latest 2011 census. Furthermore, the raw data show that Greater London has a larger volume of geotagged Flickr photographs than many other cities. In particular, Inner London [51], the interior part of Greater London, is used for characterisation, as a large volume of Flickr photographs are available from Inner London over a variety of years. Figure 1 demonstrates the spatial density of the photographs in Inner London and Greater London visualised by kernel density estimation (KDE) [52].
Flickr is an online photograph management and sharing website, where public photographs uploaded by users can be requested and downloaded from its public application programming interface (API, https://www.flickr.com/services/api/). The scale of Flickr is extensive, with 122 million users and over 10 billion photographs as of 2016, with a large degree of penetration [53]. Unlike commonly used geotagged GSV images that are not real-time [54], Flickr image data are accessible at any time and have been available since 2004, making it feasible to investigate the dynamic characteristics of UAOIs in a finer time dimension [9]. Furthermore, the locations of Flickr images result from human choices and are a representation of human interactions with the built environment. However, photographs are captured in a biased way, as the aspects of the urban environment rely on how populations interact with that environment. As such, the representation of Flickr images is skewed and not necessarily realistic. This warrants caution when drawing conclusions. Nevertheless, we argue that Flickr image data are still meaningful for our study due to their embodiment of human perceptions of the built environment and flexibility in the time dimension. Flickr is an online photograph management and sharing website, where public photographs uploaded by users can be requested and downloaded from its public application programming interface (API, https://www.flickr.com/services/api/). The scale of Flickr is extensive, with 122 million users and over 10 billion photographs as of 2016, with a large degree of penetration [53]. Unlike commonly used geotagged GSV images that are not real-time [54], Flickr image data are accessible at any time and have been available since 2004, making it feasible to investigate the dynamic characteristics of UAOIs in a finer time dimension [9]. Furthermore, the locations of Flickr images result from human choices and are a representation of human interactions with the built environment. However, photographs are captured in a biased way, as the aspects of the urban environment rely on how populations interact with that environment. As such, the representation of Flickr images is skewed and not necessarily realistic. This warrants caution when drawing conclusions. Nevertheless, we argue that Flickr image data are still meaningful for our study due to their embodiment of human perceptions of the built environment and flexibility in the time dimension.
The first two stages of data pre-processing and UAOI extraction are based on the framework of [9]. All geotagged Flickr metadata uploaded within Inner London have been collected through a bounding box, with a time span from the first day of 2013 to the last day of 2015. The attributes of each data record include geographic coordinates, the capture times of the photographs, user IDs, and download URLs for each photograph. This three-year time span has more Flickr photographs than others, since the site was launched in 2004. It also allows us to explore the dynamic characteristics of images within UAOIs by subdividing the time by month. To decrease the influence of a few active users who will dominate the analysis outcomes, we retained only one photograph for each user based on the tags used and the time when the photograph was taken [9]. This is because some active users may take many similar photographs in a high-density area, which would influence the extraction of The first two stages of data pre-processing and UAOI extraction are based on the framework of [9]. All geotagged Flickr metadata uploaded within Inner London have been collected through a bounding box, with a time span from the first day of 2013 to the last day of 2015. The attributes of each data record include geographic coordinates, the capture times of the photographs, user IDs, and download URLs for each photograph. This three-year time span has more Flickr photographs than others, since the site was launched in 2004. It also allows us to explore the dynamic characteristics of images within UAOIs by subdividing the time by month. To decrease the influence of a few active users who will dominate the analysis outcomes, we retained only one photograph for each user based on the tags used and the time when the photograph was taken [9]. This is because some active users may take many similar photographs in a high-density area, which would influence the extraction of UAOIs. Specifically, if a user took several photographs in a minute but with the same tags, only one photograph was retained. The rationale for this approach was to remove photographs within a limited spatial extent based on the hypothesis that a person's average walking speed is 5 km/h [55]. On this basis, the maximum walking distance within a minute is approximately 83 m. Within this short distance, only a single user's photograph with the same text is retained.
For UAOI extraction, we rely on the methodology from [9], which combines HDBSCAN (hierarchical density-based spatial clustering for application with noise) [56] and alpha shapes [57]. We identified UAOIs every month by HDBSCAN and constructed the corresponding boundary for each UAOI via Alpha shapes. Figure 2 shows the spatial distribution of all extracted UAOIs from 2013 to 2015 in Inner London in a light coral colour. We subsequently downloaded all photographs within Inner London through the URL links embedded in the Flickr metadata. Since spatial information is available for the UAOIs, in other words, images that are grouped as UAOIs are available, we subsequently divided them into two image subsets: UAOI and NON-UAOI images, with total numbers of 187,064 and 816,058 photographs, respectively. ISPRS Int. J. Geo-Inf. 2020, 9,264 6 of 17 limited spatial extent based on the hypothesis that a person's average walking speed is 5 km/h [55]. On this basis, the maximum walking distance within a minute is approximately 83 m. Within this short distance, only a single user's photograph with the same text is retained. For UAOI extraction, we rely on the methodology from [9], which combines HDBSCAN (hierarchical density-based spatial clustering for application with noise) [56] and alpha shapes [57]. We identified UAOIs every month by HDBSCAN and constructed the corresponding boundary for each UAOI via Alpha shapes. Figure 2 shows the spatial distribution of all extracted UAOIs from 2013 to 2015 in Inner London in a light coral colour. We subsequently downloaded all photographs within Inner London through the URL links embedded in the Flickr metadata. Since spatial information is available for the UAOIs, in other words, images that are grouped as UAOIs are available, we subsequently divided them into two image subsets: UAOI and NON-UAOI images, with total numbers of 187,064 and 816,058 photographs, respectively.

Extracting the Characteristics from UAOIs and Outer Areas
To uncover the potential driving factors that influence the formation of UAOIs, an image recognition technique is used to identify the objects in each Flickr image. CNN models are generally designed to process data in the form of multiple arrays, such as colourful image data consisting of three 2D arrays presented as pixel values in the three colour channels.
In this work, an image classification model, Places365 CNN, is used to extract the characteristics of UAOIs. The reason for using this classification model instead of object detection is primarily that we are interested in the characteristics of places. Places365 CNN can work as a classifier to identify scenes from the built environment. Alternatively, other image recognition models could be used as well, but we deemed Places365 CNN as the most productive model in this context of the study. Places365 is the latest subset of the Places2 Database, which is trained by 1.8 million images from 365 scene categories, where there are, at most, 5000 images per category [50]. We specifically use the Places365-ResNet model, fine-tuned on the ResNet152 (152-layer Residual Network) architecture. This CNN model has the best performance; its top five classification accuracy reaches 85.08%,

Extracting the Characteristics from UAOIs and Outer Areas
To uncover the potential driving factors that influence the formation of UAOIs, an image recognition technique is used to identify the objects in each Flickr image. CNN models are generally designed to process data in the form of multiple arrays, such as colourful image data consisting of three 2D arrays presented as pixel values in the three colour channels.
In this work, an image classification model, Places365 CNN, is used to extract the characteristics of UAOIs. The reason for using this classification model instead of object detection is primarily that we are interested in the characteristics of places. Places365 CNN can work as a classifier to identify scenes from the built environment. Alternatively, other image recognition models could be used as well, but we deemed Places365 CNN as the most productive model in this context of the study. Places365 is the latest subset of the Places2 Database, which is trained by 1.8 million images from 365 scene categories, where there are, at most, 5000 images per category [50]. We specifically use the Places365-ResNet model, fine-tuned on the ResNet152 (152-layer Residual Network) architecture. This CNN model has the best performance; its top five classification accuracy reaches 85.08%, whereas the top five classification accuracies for other popular CNNs, such as Places365-AlexNet, Places365-GoogleNet, and Places365-VGG, are 82.89%, 83.88%, and 84.91%, respectively [50].
All photographs within and outside the UAOIs are fed to the Places365-Resnet model, with the aim of exploring if there are any unique characteristics at UAOIs compared to other areas. For high-efficiency implementation, the recognition process of all photographs (approximately 100 GB) was undertaken using a single Nvidia Quadro M5000 GPU with 8 GB memory. As each photograph may contain more than one scene class, the model is set to return the maximum top five labels based on the probability for each photograph of our dataset. Furthermore, the top five labels' classification accuracy (85.08%) is far beyond that of the top one label (54.74%), which was validated in the work of [50]. Then, we integrate the probability of all identical labels together and divide by the total number of photographs for UAOIs and other areas separately. This step helps us to acquire the mean regular probability of each label in different areas. Table 1 features a numeric illustration of how the results are interpreted and visualised in Section 4.1. It displays portions of the extraction from the 365 categories/labels, where the higher probability of a label represents more significant characteristics in that area, and vice versa. Considering the temporal nature of UAOIs, certain UAOIs emerged and disappeared within just a few months (see examples in Figure 3). The UAOI in the north-west of Newham appears in July and August but disappears in September 2013, and a UAOI emerges in the middle of Southwark in August but vanishes in the next month. However, the regular characteristics recognised at the UAOIs over three years are unable to capture these minor seasonal changes. As a result, it remains challenging to explain why people would gather at certain UAOIs at specific times without identifying the dynamic patterns underlying these images. ISPRS Int. J. Geo-Inf. 2020, 9,264 7 of 17 whereas the top five classification accuracies for other popular CNNs, such as Places365-AlexNet, Places365-GoogleNet, and Places365-VGG, are 82.89%, 83.88%, and 84.91%, respectively [50]. All photographs within and outside the UAOIs are fed to the Places365-Resnet model, with the aim of exploring if there are any unique characteristics at UAOIs compared to other areas. For highefficiency implementation, the recognition process of all photographs (approximately 100 GB) was undertaken using a single Nvidia Quadro M5000 GPU with 8 GB memory. As each photograph may contain more than one scene class, the model is set to return the maximum top five labels based on the probability for each photograph of our dataset. Furthermore, the top five labels' classification accuracy (85.08%) is far beyond that of the top one label (54.74%), which was validated in the work of [50]. Then, we integrate the probability of all identical labels together and divide by the total number of photographs for UAOIs and other areas separately. This step helps us to acquire the mean regular probability of each label in different areas. Table 1 features a numeric illustration of how the results are interpreted and visualised in Section 4.1. It displays portions of the extraction from the 365 categories/labels, where the higher probability of a label represents more significant characteristics in that area, and vice versa.  Figure 3). The UAOI in the north-west of Newham appears in July and August but disappears in September 2013, and a UAOI emerges in the middle of Southwark in August but vanishes in the next month. However, the regular characteristics recognised at the UAOIs over three years are unable to capture these minor seasonal changes. As a result, it remains challenging to explain why people would gather at certain UAOIs at specific times without identifying the dynamic patterns underlying these images. To understand the factors that contributed to the dynamic changes of UAOIs, we subdivided photographs into a finer temporal resolution (i.e., we grouped photographs by month). Similarly, the maximum top five probabilities of labels were returned, and the mean probability of each label for UAOIs and Non-UAOIs in a month was calculated. Next, 36 tables similar to Table 1 were acquired in different months. Then, we concatenated them into a single table and determined the label probability of the UAOIs, where the row and column represent 365 features and 36 different months separately. We finally calculated the average values of the label probabilities for identical months but for different years, as shown in Table 2, which includes a small sample from the 365 labels and a numeric illustration for Section 4.2. By doing this, the significant characteristics for the UAOIs in different months are identified, thereby allowing us to capture several interesting dynamic patterns. To understand the factors that contributed to the dynamic changes of UAOIs, we subdivided photographs into a finer temporal resolution (i.e., we grouped photographs by month). Similarly, the maximum top five probabilities of labels were returned, and the mean probability of each label for UAOIs and Non-UAOIs in a month was calculated. Next, 36 tables similar to Table 1 were acquired in different months. Then, we concatenated them into a single table and determined the label probability of the UAOIs, where the row and column represent 365 features and 36 different months separately. We finally calculated the average values of the label probabilities for identical months but for different years, as shown in Table 2, which includes a small sample from the 365 labels and a numeric illustration for Section 4.2. By doing this, the significant characteristics for the UAOIs in different months are identified, thereby allowing us to capture several interesting dynamic patterns. The probability values from Table 2 vary greatly among individual labels. For example, the values of the label "tower" are about 20 times higher than the values for the label "carousel". The disparity of scales created a large challenge in simultaneously comparing the variety of all characteristics. To handle this, we calculated the z-score to standardise all label probability values by row; these values can be used to compare the results to the sample mean of the label probability for every row. This method returns a normalised value (z-score) based on its mean and standard deviation. The basic Z-Score can be calculated by the formula below: where x represents the value of the data point, and x and s represent the sample mean and sample standard deviation, respectively. This process ensures that the values in each row in Table 2 are on the same scale, thus laying the foundation for the subsequent heatmap analysis. A heatmap is a graphical presentation of data where the values contained in a matrix are represented as colours; the darker the colour is, the higher the value or the density. We performed heatmap analysis on the z-score of the probability of a label because it returns an instant visual pattern of the labels in a timeline, offering better insight into the dynamic characteristics of UAOIs.

Regular Characteristics of UAOIs and Non-UAOIs
Based on the mean regular probabilities of the 365 categories for UAOIs and outside areas, we visualised the top 50 categories for both in an inverted pyramid graph (see Figure 4). The labels for the left and right y-axes were organised hierarchically, representing the significance of the characteristics from most to least within and outside the UAOIs. The top three characteristics for UAOIs are "tower", "skyscraper", and "bridge", suggesting that the Tower of London, skyscrapers, and a variety of bridges, such as Millennium Bridge and Tower Bridge, are the most significant representations of UAOIs and the primary reasons for why people gathered in these places. The overall composition of the UAOIs includes iconic landmarks, historic and famous buildings, entertainment places, and museums and galleries, as the most high-frequency appearances of these characteristics include the tags "canal", "harbour", "church", "amusement park", "museum", "gallery", and so on. The components of areas outside the UAOIs are more strongly related to buses or train stations, as well as several indoor venues, such as "arena", "music studio", "conference centre", and "shops". These are ordinary scenes from daily life, which are less attractive to large numbers of people. There are a few repetitive characteristics in the top 50 for both categories, making it difficult to determine the differences between UAOIs and Non-UAOIs. For example, the labels "tower", "street", "bus_station", "skyscraper", and "downtown" are identified in the top 10 for both. We then distinguished the most significant characteristics for both areas by calculating the different values of the mean regular probability of all labels in the UAOIs and Non-UAOIs. Figure 5 shows the differences of features between UAOIs and Non-UAOIs. By plotting this, features that are common in both would cancel out if their probabilities were the same and thus not feature in the figure. The bars in light coral and grey, respectively, represent more significant features for UAOIs and Non-UAOIs. A total number of 28 labels have a higher probability in UAOIs, while more labels are identifiable in Non-UAOIs. This can be attributed to the huge and manifold areas of Non-UAOIs, where larger numbers of photographs were taken. Although the significant levels of characteristics in UAOIs and Non-UAOIs are slightly different from those in Figure 4, the overall pattern conforms to the features shown above. UAOIs involve more scenic spots and places of entertainment, such as "tower", "church", "canal", "fountain", "amusement park", and "shopping mall", while the areas of less interest are more strongly related to daily life, including labels like "bus station", "street", "bar", "conference centre", and "railroad track". There are a few repetitive characteristics in the top 50 for both categories, making it difficult to determine the differences between UAOIs and Non-UAOIs. For example, the labels "tower", "street", "bus_station", "skyscraper", and "downtown" are identified in the top 10 for both. We then distinguished the most significant characteristics for both areas by calculating the different values of the mean regular probability of all labels in the UAOIs and Non-UAOIs. Figure 5 shows the differences of features between UAOIs and Non-UAOIs. By plotting this, features that are common in both would cancel out if their probabilities were the same and thus not feature in the figure. The bars in light coral and grey, respectively, represent more significant features for UAOIs and Non-UAOIs. A total number of 28 labels have a higher probability in UAOIs, while more labels are identifiable in Non-UAOIs. This can be attributed to the huge and manifold areas of Non-UAOIs, where larger numbers of photographs were taken. Although the significant levels of characteristics in UAOIs and Non-UAOIs are slightly different from those in Figure 4, the overall pattern conforms to the features shown above. UAOIs involve more scenic spots and places of entertainment, such as "tower", "church", "canal", "fountain", "amusement park", and "shopping mall", while the areas of less interest are more strongly related to daily life, including labels like "bus station", "street", "bar", "conference centre", and "railroad track". These regular characteristics quantitatively suggest why people would gather at UAOIs regularly over several years, as well as the characteristic differences between UAOIs and other areas. A large number of world-famous landmarks, modern skyscrapers, large-scale shopping malls, plazas, and places of entertainment are located at UAOIs. The uniqueness of these elements has attracted thousands of people (both travellers and residents in Inner London) to take photographs of them. Conversely, the characteristics of photographs taken outside UAOIs are relatively common and anonymous and are primarily associated with daily-life scenes. We would like to highlight that the features like music studio and pub display a small lean over Non-UAOI but do not feature as a clear signifier of the class (in other words, they can be found in the middle of the figure). Subjectively, this could correspond to people taking photos with no specific purpose at these areas compared to the more purposeful photographs taken within UAOIs, such as recording certain tourist attractions like the Tower Bridge.
More importantly, the results demonstrate that geotagged Flickr images can be used to quantify the characteristics of the urban environment instead of tags. This has been rarely explored in past research, where quite a few studies have instead used tags of Flickr to understand the urban environment and people's perceptions of it [7, 8,22]. Moreover, these results will help to familiarise us with the perception features of large communities at a local scale, whereas previous attempts were primarily focused on global urban appearance features.

Dynamic Characteristics of UAOIs
Based on the z-score conversion, Figure 6 displays a heatmap with the top 50 labels in terms of probability of occurrence. This representation uncovers the underlying characteristics of UAOIs at certain time periods, where darker red or darker blue represent the standard deviation above or below the mean of a label over the period, respectively. The top three characteristics of the UAOIs "tower", "skyscraper", and "bridge" primarily present an intermediate colour between red and blue, with z scores ranging from −1 to 1, implying that these three characteristics remain attractive to people all year round. The colour for several transport-related labels, such as "subway_station", "train_station", and "airport_teminal" was slightly red from January to March but was blue for the rest of the year, suggesting that more photographs with these travel modes were taken during these months. Conversely, people's travel mode priorities might differ when the weather becomes warmer, possibly including more walking and fewer vehicles. This manifests in the "street", "promenade", and "crosswalk" labels, whose z-scores of probability peak in June or July but remain at an average These regular characteristics quantitatively suggest why people would gather at UAOIs regularly over several years, as well as the characteristic differences between UAOIs and other areas. A large number of world-famous landmarks, modern skyscrapers, large-scale shopping malls, plazas, and places of entertainment are located at UAOIs. The uniqueness of these elements has attracted thousands of people (both travellers and residents in Inner London) to take photographs of them. Conversely, the characteristics of photographs taken outside UAOIs are relatively common and anonymous and are primarily associated with daily-life scenes. We would like to highlight that the features like music studio and pub display a small lean over Non-UAOI but do not feature as a clear signifier of the class (in other words, they can be found in the middle of the figure). Subjectively, this could correspond to people taking photos with no specific purpose at these areas compared to the more purposeful photographs taken within UAOIs, such as recording certain tourist attractions like the Tower Bridge.
More importantly, the results demonstrate that geotagged Flickr images can be used to quantify the characteristics of the urban environment instead of tags. This has been rarely explored in past research, where quite a few studies have instead used tags of Flickr to understand the urban environment and people's perceptions of it [7, 8,22]. Moreover, these results will help to familiarise us with the perception features of large communities at a local scale, whereas previous attempts were primarily focused on global urban appearance features.

Dynamic Characteristics of UAOIs
Based on the z-score conversion, Figure 6 displays a heatmap with the top 50 labels in terms of probability of occurrence. This representation uncovers the underlying characteristics of UAOIs at certain time periods, where darker red or darker blue represent the standard deviation above or below the mean of a label over the period, respectively. The top three characteristics of the UAOIs "tower", "skyscraper", and "bridge" primarily present an intermediate colour between red and blue, with z scores ranging from −1 to 1, implying that these three characteristics remain attractive to people all year round. The colour for several transport-related labels, such as "subway_station", "train_station", and "airport_teminal" was slightly red from January to March but was blue for the rest of the year, suggesting that more photographs with these travel modes were taken during these months. Conversely, people's travel mode priorities might differ when the weather becomes warmer, possibly including more walking and fewer vehicles. This manifests in the "street", "promenade", and "crosswalk" labels, whose z-scores of probability peak in June or July but remain at an average probability during the other months. We also uncovered various seasonal patterns of indoor and outdoor activities for UAOIs. For example, a series of indoor museums and galleries labelled as "museum/indoor", "natural_history_museum", "science museum", and "art_gallery" were more prevalent during relatively cold months (February and March) compared with the others, while a number of magnificent buildings, as well as outdoor leisure places, with labels like "church", "palace", "mosque", "castle", "plaza", "bazaar", and "sky" were more likely to be identified in relatively warm seasons. These dynamic patterns demonstrate that season has an important impact on human activity and considerably changes the travel modes and activity modes of people, leading to the different scene characteristics of UAOIs over the year. UAOI features tend to contain more vehicles and indoor buildings in winter, as people prefer to take photographs of vehicles and indoor activities during the cold season. Correspondingly, the UAOI features consist of more crosswalks, magnificent buildings, These dynamic patterns demonstrate that season has an important impact on human activity and considerably changes the travel modes and activity modes of people, leading to the different scene characteristics of UAOIs over the year. UAOI features tend to contain more vehicles and indoor buildings in winter, as people prefer to take photographs of vehicles and indoor activities during the cold season. Correspondingly, the UAOI features consist of more crosswalks, magnificent buildings, and recreational areas in warmer months, as more photographs related to these features were taken during this period.
These results also illustrate how urban perception changes over time, showing that dynamic analytics are important for the urban environment. These bridge the identified research gap in the dynamic features of cities [10,36]. Meanwhile, the practical implications of the dynamic characteristics of UAOIs can be reflected in the actions of retailers and local authorities. For example, a few retailers within UAOIs could expand their opening hours or deliver targeted advertising to potential customers in the summer, as people were more active during this period.

Capacity and Bias of Using Places365-CNN within This Context
In addition, the above heatmap also suggests that certain patterns deserve special attention. It is obvious that some characteristics are highly popular (i.e., reddest) over just a single month, such as coffee shops, streets, crosswalks, and amusement parks. To investigate what happened during these months with the corresponding characteristics, the "amusement_park" label was selected as an example for inspection. Specifically, we extracted the photographs that were classified as "amusement_park" in December for three years, setting a classification probability of 0.5 to filter photographs less than the threshold. A total of 175 photographs were kept after filtering, the majority of which (54.7%) were distributed at UAOIs, where Hyde Park, Trafalgar Square, London Bridge, and North Greenwich are located. Figure 7 (Due to the different shapes of the photographs, some images have been rescaled and cropped to aid visualisation in this figure. Photographers (Flickr user IDs) of images in Figure 7: ©17576427@N00, ©89333651@N00, ©91832335@N04, ©42230049@N03, ©16483105@N02, ©87076514@N02, ©64882892@N08, ©24605992@N06, ©75209620@N00, ©42112515@N06, ©42230049@N03, ©29558445@N00, ©36054481@N00, ©74264857@N00. Copyright of the images is retained by the photographers) displays a handful of samples from the 175 photographs we extracted, which were taken by various photographers in various years. Here we can see a Ferris wheel, street food markets, roller coaster rides, ice skating, and carousels; these types of scene attribute are located in the upper half of the images that were taken at Hyde Park. This seems to be related to Hyde Park's Winter Wonderland, a Christmas extravaganza that is open to the public for 6 weeks every year from mid-November to the end of December [58]. This is one of the reasons that "amusement_park" peaked in December, in agreement with our common knowledge.
However, this does not relate exactly to the installation of an actual amusement park when examining the photographs shown in the rest of Figure 7. These photographs were taken at Trafalgar Square instead of Hyde Park, where a sculpture of a giant blue chicken, a Christmas tree, and a fountain with a red light were captured by multiple photographers. These scenes are not parts of an amusement park in the strictest sense, but their integration at a specific place and time can be considered a provisional amusement park, as the blue sculptures, green trees, and red fountains are similar to the colourful characteristics of an amusement park. The probable reason for this phenomenon is that groups of people gathered around Trafalgar Square in December because the Christmas tree appeared here in early December, and manifold events, such as a lighting ceremony and carol singing, happened during this period [59]. Therefore, "amusement_park" became extremely prevalent in December because many seasonal landmarks appeared, and spectacular events happened in a few UAOIs due to Christmas.

Conclusions
In this study, a recent and rarely used image recognition method, Places365 CNN, was used to extract and quantify features of the local urban environment from Flickr photographs. We first compared the differences of the regular characteristics within and outside UAOIs over three years. Then, we explored the dynamic characteristics of UAOIs over that period. The results help explain why people become interested in certain urban areas more than others, what characteristics these areas possess, and if these characteristics can change over time. We found that the UAOIs were mainly identified in areas where iconic landmarks, tourist attractions, magnificent buildings, and leisure zones are located, such as towers, bridges, skyscrapers, churches, plazas, and shopping malls-which are different from the characteristics of Non-UAOIs, where more daily life-related areas are captured, such as stations, shops, and indoor venues. In terms of the dynamic characteristics of the UAOIs, UAOIs extracted in the winter contained more vehicles and indoor buildings, while UAOIs extracted in others season consisted of more crosswalks, magnificent buildings, and recreational areas. These patterns demonstrate that season has an important impact on human preferences for travel and activity modes. People tend to travel by various vehicles and conduct indoor activities on cold winter days but walk and engage in outdoor activities when the weather gets warmer.
This study contributes to both the theoretical and practical domains. We demonstrated that Flickr photographs themselves can be used to understand the perceived features of cities, instead of traditional methods, by using Flickr tags and other image sources like GSV images. More importantly, this work provides a potential way to bridge the research gap between image recognition techniques and urban perception analytics. Local scales and dynamic characteristics play important roles in recognising the features of the urban environment. In terms of practical significance, the regular and dynamic characteristics of the urban environment provide new insights for policymakers, who can use these findings as vital evidence for decision making. The regular This pattern demonstrates that the pre-trained Places365-CNN model may not fit Flickr images very well, as several images can be identified based on biased characteristics. Nevertheless, the capacity of this CNN model to unpack the characteristics of the local built environment cannot be underestimated, which other models rarely have. This model successfully identified several pieces of useful information from urban areas, which can be used as a reference for policymakers and stakeholders.

Conclusions
In this study, a recent and rarely used image recognition method, Places365 CNN, was used to extract and quantify features of the local urban environment from Flickr photographs. We first compared the differences of the regular characteristics within and outside UAOIs over three years. Then, we explored the dynamic characteristics of UAOIs over that period. The results help explain why people become interested in certain urban areas more than others, what characteristics these areas possess, and if these characteristics can change over time. We found that the UAOIs were mainly identified in areas where iconic landmarks, tourist attractions, magnificent buildings, and leisure zones are located, such as towers, bridges, skyscrapers, churches, plazas, and shopping malls-which are different from the characteristics of Non-UAOIs, where more daily life-related areas are captured, such as stations, shops, and indoor venues. In terms of the dynamic characteristics of the UAOIs, UAOIs extracted in the winter contained more vehicles and indoor buildings, while UAOIs extracted in others season consisted of more crosswalks, magnificent buildings, and recreational areas. These patterns demonstrate that season has an important impact on human preferences for travel and activity modes. People tend to travel by various vehicles and conduct indoor activities on cold winter days but walk and engage in outdoor activities when the weather gets warmer.
This study contributes to both the theoretical and practical domains. We demonstrated that Flickr photographs themselves can be used to understand the perceived features of cities, instead of traditional methods, by using Flickr tags and other image sources like GSV images. More importantly, this work provides a potential way to bridge the research gap between image recognition techniques and urban perception analytics. Local scales and dynamic characteristics play important roles in recognising the features of the urban environment. In terms of practical significance, the regular and dynamic characteristics of the urban environment provide new insights for policymakers, who can use these findings as vital evidence for decision making. The regular characteristics of UAOIs would be informative for urban planners to give them a macroscopic understanding of urban areas and aid them in formulating relevant policies, such as investing more funds in certain UAOIs to stimulate consumption for economic growth. The dynamic characteristics of UAOIs can help transport planners regulate trip frequency in various seasons, with a greater trip frequency in the winter than in the summer. Furthermore, a few retailers may also be inspired by the dynamic characteristics of UAOIs, helping them to better design personalised advertisements at specific places and times or expand their opening hours in the summer.
However, the limitations of this study warrant further attention in future work. Flickr offers only one type of geotagged image data. Future work should incorporate multiple image sources together, which would make the results more persuasive and improve the coverage of the analysis. In addition, although the Places365 CNN model that we used to extract the urban features has a relatively high classification accuracy compared to others, the model is trained on the Places2 dataset, which may differ from the Flickr dataset in this study. This could lead to several features identified by Places365-CNN being incompatible with the real features of images. This issue can be addressed by manually labelling the features for a certain number of images and then retraining them by fine-tuning the parameters in the max-pooling layer of the Places365-CNN. Finally, the study area we selected was located at the local level of Inner London; more interesting patterns could be uncovered at a smaller scale by including more cities in future work.
Author Contributions: Conceptualization, Meixu Chen and Dani Arribas-Bel; methodology, software, investigation, and writing-original draft preparation, Meixu Chen; writing-review, editing and supervision, Dani Arribas-Bel and Alex Singleton. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.