Deep Neural Networks and Kernel Density Estimation for Detecting Human Activity Patterns from Geo-Tagged Images: A Case Study of Birdwatching on Flickr

: Thanks to recent advances in high-performance computing and deep learning, computer vision algorithms coupled with spatial analysis methods provide a unique opportunity for extracting human activity patterns from geo-tagged social media images. However, there are only a handful of studies that evaluate the utility of computer vision algorithms for studying large-scale human activity patterns. In this article, we introduce an analytical framework that integrates a computer vision algorithm based on convolutional neural networks (CNN) with kernel density estimation to identify objects, and infer human activity patterns from geo-tagged photographs. To demonstrate our framework, we identify bird images to infer birdwatching activity from approximately 20 million publicly shared images on Flickr, across a three-year period from December 2013 to December 2016. In order to assess the accuracy of object detection, we compared results from the computer vision algorithm to concept-based image retrieval, which is based on keyword search on image metadata such as textual description, tags, and titles of images. We then compared patterns in birding activity generated using Flickr bird photographs with patterns identiﬁed using eBird data—an online citizen science bird observation application. The results of our eBird comparison highlight the potential differences and biases in casual and serious birdwatching, and similarities and differences among behaviors of social media and citizen science users. Our analysis results provide valuable insights into assessing the credibility and utility of geo-tagged photographs in studying human activity patterns through object detection and spatial analysis.


Introduction
The availability and widespread use of large-scale, user-generated geo-tagged and time-stamped images offer a unique opportunity to capture human activity patterns.For example, Flickr (www.flickr.com), provides a proxy for capturing human recreational activities from local to global geographic scales [1][2][3].Previous studies that utilize online photographs for inferring nature-based human activities have a diverse research focus.Some examples include, but are not limited to, identifying character of landscapes [4], land cover and land use [5], recreational demand in water resources [1], events and tourist hotspots [6][7][8], impact of rare species in tourism [9], common tourist trajectories [10,11], recreational visitation at national parks [3], the perceived aesthetic value of ecosystems [12,13], as well as the relationship between cultural ecosystem services and landscape features [14,15].Foursquare, as most of the content is produced in urban areas [25,26].Urban-rural divide in social media usage generates a further limitation, the small area or number problem, which leads to spurious variation in patterns extracted from areas with low density of observations [27].In order to address the varying density problem, researchers have used expectation surface using the Chi-statistic and density estimation [28].In an expectation surface the number of observed photograph counts are compared to expected values derived from population density.To keep the most active users from dominating elicited spatial patterns, previous studies normalized the number of photographs by each user based on a threshold determined by distance [4], or distinct user count and density criteria [17].Other social media-based indicators used for mapping recreational activities include Flickr photograph counts [9], the number of individual Panoramio users participating in specific activities [12], and Flickr generated user-days based on the number of photographs taken by individual users on unique days in a location [1,3,29].

Computer Vision Algorithms
Computer vision algorithms are used to extract semantics from image properties such as color, shape, texture, or any other information that can be derived from the image itself.Computer vision algorithms are different from concept or description-based image indexing [30,31], which searches keywords in image metadata such as title, tags, and descriptions to infer concepts and semantics from images.Deep learning, which allows extraction of high level abstractions in data by utilizing a hierarchical architecture [32], have been widely used in a variety of applications such as natural language processing, semantic parsing, transfer learning, and computer vision.Extracting semantics from images remains a significant challenge due to the semantic gap problem of extracting high-level semantic concepts from low-level image pixels captured by the deep learning algorithms.However, there has been a variety of successful applications of CNNs for addressing the semantic gap and extracting context from geo-tagged images.For example, Porzi et al. [33] employed a CNN architecture to capture peoples' perception of safety, attractiveness, and uniqueness using Google Street View Images.CNNs have also been widely used in image-based geo-localization and scene recognition.Geo-localization and scene recognition algorithms use image content to identify the location where the image was taken, as well as the characteristics of that location [16,34,35].Similarly, Tracewski et al. [5] applied neural networks to identify land cover and land use classification of images obtained from various social media sources such as Flickr, Panoramio, Geograph, and Instagram.For a more in-depth discussion of the deep learning methods, readers may refer to Guo et al. [32], Wan et al. [36], and Yang et al. [37].

Birdwatching
Birdwatching is a non-consumptive outdoor recreational activity that arose in the early 1900s [38].Birdwatching is a popular activity, with an estimated 46.7 million individuals participating in birding annually in the US [39], and over six million individuals observing birds at least every three weeks in the United Kingdom (CBI 2011).Most birders (88%) view birds around their homes, but many (38%) travel, often great distances, to birdwatch [39].Birdwatching falls into several, overlapping categories based on expertise and motivation [40,41]."Birding" is undertaken by hobbyist, professionals, or semi-professionals focused on studying and identifying birds.Birdwatching may also function as a sport through "listing", an often-competitive process whereby individuals maintain checklists of species they have observed.Birders may seek out rare species or species located outside of their typical range, watching blogs for reports of such species, and traveling great distances to add such birds to their lists.Such activities typically involve documenting observations, often via photographs shared through social media, and may involve posting sightings to citizen science applications such as eBird.These activities produce substantial economic impacts related to the purchase of equipment (e.g., telescopes, binoculars, cameras, and bird feeding supplies), and to travel and associated expenses (e.g., air fare, lodging, and dining).
Bird species may be migratory or resident and, if migratory, may be short-(i.e., moving within the same local area), medium-(e.g., moving among US states), or long-distance migrants (e.g., moving across or among continents).Migration follows seasonal variation in resource availability and environmental conditions, and includes movements between breeding and wintering locations in the fall and spring seasons [42].During migration, birds utilize stopover sites of varied habitat quality for resting and refueling.Stopover sites of high habitat quality may serve numerous species in relatively high densities.These sites may be particularly interesting to birdwatchers, especially if they are publicly accessible (e.g., parks and wildlife refuges) for observing many transitory species in one location.As a result of the competitive nature of listing, social media and citizen science postings may highlight more rare and transient birds during migration at stopover hotspots.

Materials and Methods
Figure 1 illustrates the analytical framework proposed in this article.We first collected all geo-located Flickr image metadata and images, and eBird observations given the photograph-taken and observation date, respectively, and the border of conterminous US across a three-year period between December 2013 and December 2016.Flickr metadata consisted of attributes that identify the photograph by id, name and identification number of the user, the location where the photograph was taken (i.e., longitude and latitude coordinates that are either manually added by users or generated by camera/smartphones), the time and date on which the photograph was taken and uploaded, and textual annotations provided by users and the application, including tags, description and title of photo contents.On the other hand, the eBird basic dataset (EBD) was a freely-available, global citizen science online bird observation dataset, collected and maintained by the Cornell Lab of Ornithology and the National Audubon Society (http://eBird.org/content/eBird) [19].eBird data contained the name, counts, and types of the species observed during a single search event, the location where the search took place, the time, date, and duration of the search, as well as the name of the observer.
the same local area), medium-(e.g., moving among US states), or long-distance migrants (e.g., moving across or among continents).Migration follows seasonal variation in resource availability and environmental conditions, and includes movements between breeding and wintering locations in the fall and spring seasons [42].During migration, birds utilize stopover sites of varied habitat quality for resting and refueling.Stopover sites of high habitat quality may serve numerous species in relatively high densities.These sites may be particularly interesting to birdwatchers, especially if they are publicly accessible (e.g., parks and wildlife refuges) for observing many transitory species in one location.As a result of the competitive nature of listing, social media and citizen science postings may highlight more rare and transient birds during migration at stopover hotspots.

Materials and Methods
Figure 1 illustrates the analytical framework proposed in this article.We first collected all geolocated Flickr image metadata and images, and eBird observations given the photograph-taken and observation date, respectively, and the border of conterminous US across a three-year period between December 2013 and December 2016.Flickr metadata consisted of attributes that identify the photograph by id, name and identification number of the user, the location where the photograph was taken (i.e., longitude and latitude coordinates that are either manually added by users or generated by camera/smartphones), the time and date on which the photograph was taken and uploaded, and textual annotations provided by users and the application, including tags, description and title of photo contents.On the other hand, the eBird basic dataset (EBD) was a freely-available, global citizen science online bird observation dataset, collected and maintained by the Cornell Lab of Ornithology and the National Audubon Society (http://eBird.org/content/eBird) [19].eBird data contained the name, counts, and types of the species observed during a single search event, the location where the search took place, the time, date, and duration of the search, as well as the name of the observer.
After the initial download, we filtered both datasets to include only the geo-tagged Flickr images and eBird observations that were within the conterminous US.In the next steps, we extracted bird images using the metadata keyword search and YOLO deep learning library.We first compared the accuracy of objection detection by YOLO with metadata search as well as their spatial patterns.Second, we extracted spatial patterns of birdwatching activity from eBird, and compared them with YOLO-detected Flickr bird photographs in order to identify the similarities and differences between eBird and Flickr in capturing birdwatching behaviors.After the initial download, we filtered both datasets to include only the geo-tagged Flickr images and eBird observations that were within the conterminous US.In the next steps, we extracted bird images using the metadata keyword search and YOLO deep learning library.We first compared the accuracy of objection detection by YOLO with metadata search as well as their spatial patterns.Second, we extracted spatial patterns of birdwatching activity from eBird, and compared them with YOLO-detected Flickr bird photographs in order to identify the similarities and differences between eBird and Flickr in capturing birdwatching behaviors.

You Only Look Once (YOLO)
YOLO is a state-of-the-art, unified, real-time objection detection system [18,43,44].Unlike other object detection approaches such as deformable parts model (DPM) [45] and R-CNN [46], YOLO frames object detection as a regression problem.During training, the following sum of squared error function was minimized [44]: where (x i , y i ), w i , h i , and C i represent the center, width, height, and class value, respectively, of the bounding box relative to the grid cell i; p i (c) represents the probability that the object in the grid cell i belongs to class c; ( xi , ŷi ), ŵi , ĥi , and Ĉi represent the center, width, height, and class value, respectively, of the training object that falls into the grid cell i; pi (c) represents the probability that the training object that falls into the grid cell i belongs to class c; I obj ij denotes if object appears in the cell i; and I noobj ij denotes that the jth bounding box predictor in cell i is "responsible" for that prediction; λ coord and λ noobj are weights for localization error and classification error, respectively.
In YOLO, a single neural network is used to simultaneously predict multiple object bounding boxes and the corresponding class probabilities directly from image pixels (Figure 2).Redmon et al. [44] compared the mean average precision (mAP) and frames per second (FPS) of YOLO to other detection algorithms using validation data sets from PASCAL VOC 2007.After evaluating the reported average precision and efficiency of other object detection algorithms including 30Hz DPM [47], Fastest DPM [48], R-CNN Minus R [49], Fast R-CNN [48], Faster R-CNN VGG-16 [50], Faster R-CNN ZF [50], we decided to use YOLO to detect birds in images.
ISPRS Int.J. Geo-Inf.2018, 7, x FOR PEER REVIEW 5 of 18 YOLO is a state-of-the-art, unified, real-time objection detection system [18,43,44].Unlike other object detection approaches such as deformable parts model (DPM) [45] and R-CNN [46], YOLO frames object detection as a regression problem.During training, the following sum of squared error function was minimized [44]: where ( ,  ),  , ℎ , and  represent the center, width, height, and class value, respectively, of the bounding box relative to the grid cell ;  () represents the probability that the object in the grid cell  belongs to class ; ( ,  ),  , ℎ , and  represent the center, width, height, and class value, respectively, of the training object that falls into the grid cell ; ̂ () represents the probability that the training object that falls into the grid cell  belongs to class ;  denotes if object appears in the cell ; and  denotes that the th bounding box predictor in cell  is "responsible" for that prediction;  and  are weights for localization error and classification error, respectively.In YOLO, a single neural network is used to simultaneously predict multiple object bounding boxes and the corresponding class probabilities directly from image pixels (Figure 2).Redmon et al. [44] compared the mean average precision (mAP) and frames per second (FPS) of YOLO to other detection algorithms using validation data sets from PASCAL VOC 2007.After evaluating the reported average precision and efficiency of other object detection algorithms including 30Hz DPM [47], Fastest DPM [48], R-CNN Minus R [49], Fast R-CNN [48], Faster R-CNN VGG-16 [50], Faster R-CNN ZF [50], we decided to use YOLO to detect birds in images.We experimented with the latest version of YOLO, YOLOv3, which took approximately 15 seconds on the Central Processing Unit (CPU).The total number of images to be processed was ~20 million.We performed high-throughput computing (HTC) on the Argon HPC system at The University of Iowa (UI).The Argon HPC system consists of 366 compute nodes with a range of 40 to 56 cores per node.Due to the substantially larger numbers and availability of CPU units, we performed HTC on CPU nodes.Argon uses Son of Grid Engine (SGE) queuing system for job submissions.Argon system has a limit of 10,000 active jobs per user, which includes currently running and pending jobs waiting to be submitted.We used array jobs to submit YOLO objection detection tasks.An array job consists of identical tasks ordered by a range of index numbers.The maximum amount of tasks an array job can handle is 75,000.Given the user and array job limits, we submitted 267 array jobs to process ~20 million images, which took approximately a week.

Kernel Density Estimation
In order to compare the spatial distributions of images identified by YOLO and metadata search, and eBird and YOLO-detected observations, we perform two types of kernel density estimation (i.e., fixed-distance and adaptive bandwidth), which were both performed using the formula below [51]: where x1, x2, …, xn ∈  is a set of observation locations within the bandwidth (neighborhood) of  ;  is the total number of observations;  is the location of estimation; K is the kernel function; and h is the bandwidth (the distance).Commonly used kernel functions are Uniform, Epanechnikov, Triangular, and Gaussian.The choice of kernel function often does not change the density estimation result.However, the bandwidth h is a key parameter that determines the outcome of fixed-distance kernel estimation.This same formula is applied to both fixed-distance and adaptive kernel density estimation (smoothing).Kernel smoothing can be performed on regular grids or spatial units of different aggregation levels (e.g., counties, census tracts, and block groups).In this study, we divided the study area into a grid of 5 miles (8 km) resolution, which covered the conterminous US.We used this resolution for the fixed-kernel density estimation for comparing spatial patterns of YOLO detection to metadata search, and eBird to YOLO-detected observations.Fixed-distance kernel smoothing requires a distance threshold to determine the bandwidth.We determined the bandwidth as 20 miles, which is approximately equivalent to the second order immediate neighborhood of a grid cell.Fixed-distance spatial filters in kernel density estimation often result in the loss of geographic detail when the density of observation is much higher.Moreover, smaller filters produce unreliable estimates in areas with sparse observations.Different from fixed-distance kernel density estimation, adaptive kernel density estimation (smoothing) [52] is a non-parametric method that uses local We experimented with the latest version of YOLO, YOLOv3, which took approximately 15 s on the Central Processing Unit (CPU).The total number of images to be processed was ~20 million.We performed high-throughput computing (HTC) on the Argon HPC system at The University of Iowa (UI).The Argon HPC system consists of 366 compute nodes with a range of 40 to 56 cores per node.Due to the substantially larger numbers and availability of CPU units, we performed HTC on CPU nodes.Argon uses Son of Grid Engine (SGE) queuing system for job submissions.Argon system has a limit of 10,000 active jobs per user, which includes currently running and pending jobs waiting to be submitted.We used array jobs to submit YOLO objection detection tasks.An array job consists of identical tasks ordered by a range of index numbers.The maximum amount of tasks an array job can handle is 75,000.Given the user and array job limits, we submitted 267 array jobs to process ~20 million images, which took approximately a week.

Kernel Density Estimation
In order to compare the spatial distributions of images identified by YOLO and metadata search, and eBird and YOLO-detected observations, we perform two types of kernel density estimation (i.e., fixed-distance and adaptive bandwidth), which were both performed using the formula below [51]: where x 1 , x 2 , . . ., x n ∈ O i is a set of observation locations within the bandwidth (neighborhood) of G i ; n is the total number of observations; x is the location of estimation; K is the kernel function; and h is the bandwidth (the distance).Commonly used kernel functions are Uniform, Epanechnikov, Triangular, and Gaussian.The choice of kernel function often does not change the density estimation result.However, the bandwidth h is a key parameter that determines the outcome of fixed-distance kernel estimation.This same formula is applied to both fixed-distance and adaptive kernel density estimation (smoothing).Kernel smoothing can be performed on regular grids or spatial units of different aggregation levels (e.g., counties, census tracts, and block groups).In this study, we divided the study area into a grid of 5 miles (8 km) resolution, which covered the conterminous US.We used this resolution for the fixed-kernel density estimation for comparing spatial patterns of YOLO detection to metadata search, and eBird to YOLO-detected observations.Fixed-distance kernel smoothing requires a distance threshold to determine the bandwidth.We determined the bandwidth as 20 miles, which is approximately equivalent to the second order immediate neighborhood of a grid cell.Fixed-distance spatial filters in kernel density estimation often result in the loss of geographic detail when the density of observation is much higher.Moreover, smaller filters produce unreliable estimates in areas with sparse observations.Different from fixed-distance kernel density estimation, adaptive kernel density estimation (smoothing) [52] is a non-parametric method that uses local information in neighborhoods defined by varying kernel sizes to estimate values of specified features at given locations.Adaptive kernel smoothing requires a minimum number of observation threshold (k-nearest observations) to determine the bandwidth, and the observations within the neighborhood of estimation as well as their spatial weights.While fixed-distance kernel allowed us to compare absolute differences between the two datasets, it did not address the user contribution bias and the small area problem, which produced unreliable estimates for areas with lower density of observations.
In addition to using fixed-distance kernel density estimation, we employed an adaptive kernel density estimation in order to derive a smoothed rate of Flickr and eBird observations based on the number of users.Due to the fact that the number of observations together amount to a large dataset with 125 million eBird observations, and 750,000 YOLO-detected bird images, and there were substantial numbers of observations that shared exact coordinates, we introduced an algorithm for efficient computation of the adaptive kernel estimation based on the number of distinct user locations.Definitions and steps of the adaptive kernel smoothing are defined below.In Step 1, the area was divided into a grid of 5 miles (8 km) resolution, which was the same resolution used in fixed-distance kernel smoothing.There was a substantial amount of observations with the same coordinates.In Step 2, we aggregated observations that shared the exact coordinates into distinct observation locations prior to determining the k-nearest users.As a result, we obtained a list of distinct observation locations that included information on the total number of observations and the list of users for both eBird and Flickr.We define k as the minimum number of users to determine the neighborhood.Given a positive neighborhood size threshold k based on the number of users, a k-size neighborhood is derived for each grid G i ∈ G, which is the smallest k-nearest-neighbors of G i that meets the size constraint.In Step 3, we employed a Sort-Tile-Recursive Tree algorithm to compute a spatial index of the k-nearest distinct users and their locations for each grid cell in order to improve the computational efficiency for determining the k-nearest distinct users and their locations.We set the threshold k as a combined number of 100 Flickr and eBird users.Once the neighborhood reached the defined threshold k, we determined the list of observations, O i , bandwidth h (G i , k), and the weights of distinct observation locations for each grid cell in Step 3. K is the kernel function, and h is the bandwidth for smoothing.Kernel functions determine the weight of each observation within a kernel, and the choice of function often does not have substantial impact on the result.The most commonly used kernel functions are Uniform, Epanechnikov, Triangular, and Gaussian.In this study, we employed the uniform kernel to simplify interpretation of the estimation.Given the list of observations, the spatial weights, and the count of observations for eBird and Flickr, we computed a continuous surface that took into account the number of distinct users in each kernel defined in Step 5.

G:
Grid: the total set of grid cells that covers the study area.
Adaptive filter (neighborhood) threshold based on the total number of distinct users.U i : The list of users within the neighborhood of G i .O i : The list of observations within the neighborhood of G i .h (G i , k): The bandwidth of the k-Size Neighborhood of the grid cell G i is defined as the smallest KNN (G i , k) = {Gj ∈ G} that has a total count of distinct users: Kernel function.Uniform function is used for simple interpretation of the results. Steps: (1) Compute G, the grid of the study area given a resolution r.In this study, r = 8 km was used.
(2) Aggregate observation statistics such as the number of observations and keep a list (hash) of users for each distinct observation location for both Flickr and eBird.
(3) Given k = 100, compute a spatial index based on Sort-tile-recursive (STR) tree for finding the k-nearest Flickr and eBird users for each grid-cell.(4) Determine O i , h (G i , k), and the weights of observations for each grid-cell using the adaptive kernel estimation.(5) Compute the percentage of YOLO-detected Flickr images to eBird observations for each grid-cell.

Results and Evaluation
Between December 2013 and December 2016, there were 19,711,242 geo-tagged Flickr images within the conterminous US.Table 1 illustrates the top 48 objects detected by YOLO, and the number of images that contain at least one of these objects.These objects were used to infer human activities as well as environmental characteristics of locations that the photographs were taken at.For example, the presence of bicycles may be useful to quantify biking behavior, sports ball may indicate sports activities, and objects such as sofa, bed, vase, and chair may indicate indoor activities.In this article, we focused on only bird images, and used birdwatching activity as a case study to demonstrate the utility of our analytical framework.The object "bird" was the 5th most frequent object detected in 747 thousand images.We organized the results and evaluation under two subsections: verification and validation.We first present our comparative evaluation of metadata search and YOLO to verify the accuracy of both approaches.Second, we compare YOLO-detected birding activity to EBird observations to evaluate the validity and biases of Flickr and eBird data in inferring birdwatching activity.

Verification
Our objective for verification was to answer the following questions:

•
Is object detection more accurate than metadata search for capturing bird images on Flickr?• Are there any spatial and temporal biases between the results of metadata search and YOLO object detection?
While we detected 747,015 (3.8%) images with birds using YOLO, we detected 534,121 (2.7%) images that contain bird keywords with metadata-based search.Table 2 represents the temporal variability in the detection of birds by metadata search and YOLO.Overall, YOLO allowed increased detection of birds over 50% of what metadata search could detect, and this increase was consistent across different seasons.There was a substantial increase of over 1% in the detected number of bird photographs when using YOLO as compared to the number images captured by the metadata search.Among the 19.7 million images, both YOLO and metadata searches commonly detected birds in 409,779 (2%) images.Since both methods detected birds in these images, we considered the classification as accurate.In order to identify the mismatch between the two methods, we further compared images detected only by YOLO and those only by metadata search.YOLO detected an additional 1.8% bird images, which were not detected by metadata search.On the other hand, metadata search detected only 0.7% additional images with bird keywords, which were not detected by YOLO.We assessed the accuracy of bird-detected images by only YOLO and only metadata search, using human classification by the first author.We defined the human classification task with the question: "Is there a real bird in this photograph?"We used a random sample of 1000 bird photographs detected by only YOLO, or by only metadata search.According to the accuracy testing, bird images classified only by the metadata search but not with YOLO resulted in a substantially lower accuracy of 26%, while bird images detected only by YOLO resulted in an accuracy of 89%.Although our sample size for human classification was low at this point, this finding confirmed the increased accuracy of YOLO detection.
Although YOLO detection had an accuracy of 89% in classifying birds, Figure 4 represents a variety of sample cases of accurate and inaccurate classifications of YOLO. Figure 4a,b,d represent accurate classifications of birds.The algorithm detected two birds in Figure 4a with an estimated accuracy of 60% and 59%, although there were obviously more birds (five) in this image.However, since the image was not tagged with bird keywords it was not captured by the metadata search.The algorithm also detected a bench with 60% accuracy, although there were multiple benches in this photograph.Both birds in Figure 4b were accurately detected by 85% and 80% estimated accuracy, and the bird in Figure 4d was accurately detected by an accuracy of 98%.Although the rest of the images in Figure 4c,e,f do not contain birds, they were inaccurately classified by YOLO as containing birds.The shape of a butterfly in Figure 4c and the shape of the flowers resemble features of a bird such as the wings, neck, and beak, which possibly led to misclassification.However, classification accuracy for these two images were low, 54% and 51%, respectively.As we did not include a threshold, we included any classification regardless of the probability value provided by YOLO.Finally, Figure 4f contains a realistic drawing of a hummingbird, which was classified as a bird by YOLO.This classification illustrates the case where classification is algorithmically accurate, but semantically inaccurate as the purpose is to identify real birds.Figure 5 represents bird images detected by only the metadata and not YOLO.Figure 5a is an accurate classification of a woodpecker, a common bird species thanks to the title of this image "Acorn woodpecker".YOLO algorithm was not able to detect the bird in this photograph because of how the bird had blended well with the tree branch, which concealed the major features of the bird for objection detection.On the other hand, Figures 5b,c,d do not contain real birds but they commonly contain "bird" keywords".A comparison of the density of bird images obtained from metadata search and YOLO are shown in Figure 6.We combined the counts of observations of YOLO and keyword search data, and employed natural breaks classification to determine the class breaks in Figure 6.While the areas of Figure 5 represents bird images detected by only the metadata and not YOLO.Figure 5a is an accurate classification of a woodpecker, a common bird species thanks to the title of this image "Acorn woodpecker".YOLO algorithm was not able to detect the bird in this photograph because of how the bird had blended well with the tree branch, which concealed the major features of the bird for objection detection.On the other hand, Figure 5b-d do not contain real birds but they commonly contain "bird" "keywords".Figure 5 represents bird images detected by only the metadata and not YOLO.Figure 5a is an accurate classification of a woodpecker, a common bird species thanks to the title of this image "Acorn woodpecker".YOLO algorithm was not able to detect the bird in this photograph because of how the bird had blended well with the tree branch, which concealed the major features of the bird for objection detection.On the other hand, Figures 5b,c,d do not contain real birds but they commonly contain "bird" keywords".A comparison of the density of bird images obtained from metadata search and YOLO are shown in Figure 6.We combined the counts of observations of YOLO and keyword search data, and employed natural breaks classification to determine the class breaks in Figure 6.While the areas of A comparison of the density of bird images obtained from metadata search and YOLO are shown in Figure 6.We combined the counts of observations of YOLO and keyword search data, and employed natural breaks classification to determine the class breaks in Figure 6.While the areas of bird images detected by YOLO and metadata search substantially overlapped, YOLO identified more bird images than metadata search for most of the study area.YOLO identified a much larger quantity of bird pictures in urban areas such as New Orleans, San Francisco, New York, Washington D.C., and Seattle.Moreover, YOLO results represent the continuity of bird habitat regions in coastal areas of Florida, the North East, Lake Michigan, and California.On the other hand, the density of bird images detected by metadata search produced more fragmented spatial patterns across the nation.
ISPRS Int.J. Geo-Inf.2018, 7, x FOR PEER REVIEW 11 of 18 bird images detected by YOLO and metadata search substantially overlapped, YOLO identified more bird images than metadata search for most of the study area.YOLO identified a much larger quantity of bird pictures in urban areas such as New Orleans, San Francisco, New York, Washington D.C., and Seattle.Moreover, YOLO results represent the continuity of bird habitat regions in coastal areas of Florida, the North East, Lake Michigan, and California.On the other hand, the density of bird images detected by metadata search produced more fragmented spatial patterns across the nation.

Validation
Our objective for validation was to answer the following questions: • To what extent can Flickr be used to infer birdwatching as a human activity pattern?

Validation
Our objective for validation was to answer the following questions: • To what extent can Flickr be used to infer birdwatching as a human activity pattern?

•
Are there any spatial and temporal biases between YOLO-detected birding activities and eBird observations?
To answer these questions, we compared YOLO-detected Flickr bird image statistics with eBird observations.We first calculated Spearman Rank correlation based on the fixed distance counts of observations and distinct users.We found a strong correlation between the count of eBird observations and YOLO-detected Flickr bird images with a correlation coefficient of 79%.Moreover, the count of distinct eBird users and Flickr users produced even a larger coefficient of 85%.These values indicated the strong overlap between eBird observations and Flickr bird images.We then compared the temporal patterns of Flickr image, user, and image-to-user ratios with eBird observation, user and observation-to-user ratios (Figure 7).Overall, Flickr had a declining trend from 2013 to 2016 both in terms of the number of bird photographs and users.This decline was also consistent with the overall decline in Flickr usage.On the contrary, eBird observations and users exhibited an increasing trend over the three-year period.Both Flickr and eBird photographs and users statistics peaked in spring months.Photograph-to-user ratio had an increasing trend for Flickr.On the other hand, eBird observation-to-user ratio was very consistent across the three-year period and peaked around spring and summer months. observations?
To answer these questions, we compared YOLO-detected Flickr bird image statistics with eBird observations.We first calculated Spearman Rank correlation based on the fixed distance counts of observations and distinct users.We found a strong correlation between the count of eBird observations and YOLO-detected Flickr bird images with a correlation coefficient of 79%.Moreover, the count of distinct eBird users and Flickr users produced even a larger coefficient of 85%.These values indicated the strong overlap between eBird observations and Flickr bird images.We then compared the temporal patterns of Flickr image, user, and image-to-user ratios with eBird observation, user and observation-to-user ratios (Figure 7).Overall, Flickr had a declining trend from 2013 to 2016 both in terms of the number of bird photographs and users.This decline was also consistent with the overall decline in Flickr usage.On the contrary, eBird observations and users exhibited an increasing trend over the three-year period.Both Flickr and eBird photographs and users statistics peaked in spring months.Photograph-to-user ratio had an increasing trend for Flickr.On the other hand, eBird observation-to-user ratio was very consistent across the three-year period and peaked around spring and summer months.
Between December 2013 and December 2016, there were 125,179,161 eBird observations within the bounding box of the conterminous US.Among these observations, 115,682,223 observations were exactly within the conterminous US.There were only 1,422,554 distinct coordinates, which corresponds to 1% of eBird observations in the conterminous US.This was mostly due to the multiple observations made from the same site throughout the day.Among 746,998 Flickr bird images, 346,549 images had distinct coordinates (46%), while the rest of the 54% of the images had coordinates that repeated more than once.This was also a result of the same user's, or even in rare cases, multiple users' sharing of multiple images from the same coordinates (e.g., habitat observation towers).We attributed this pattern to Flickr users' casual birdwatching behavior as compared to eBird users' serious birdwatching activity.Between December 2013 and December 2016, there were 125,179,161 eBird observations within the bounding box of the conterminous US.Among these observations, 115,682,223 observations were exactly within the conterminous US.There were only 1,422,554 distinct coordinates, which corresponds to 1% of eBird observations in the conterminous US.This was mostly due to the multiple observations made from the same site throughout the day.Among 746,998 Flickr bird images, 346,549 images had distinct coordinates (46%), while the rest of the 54% of the images had coordinates that repeated more than once.This was also a result of the same user's, or even in rare cases, multiple users' sharing of multiple images from the same coordinates (e.g., habitat observation towers).We attributed this pattern to Flickr users' casual birdwatching behavior as compared to eBird users' serious birdwatching activity.
In order to identify the spatial variation among Flickr bird images and eBird observations, we compared kernel density estimates of YOLO and eBird observations with a fixed-distance threshold of 20 miles (Figure 8).We observed an increased dispersion of the spatial distribution of eBird observations, which can be attributed to the fact that eBird had approximately 167 times more observations than Flickr bird photographs, and approximately 3.7 times more users than Flickr users who took bird photographs.We computed the z-scores for both YOLO and eBird observations in order to compare the two different distributions in which eBird observations had much higher density than YOLO-detected Flickr photographs.We combined the z-scores of the two dataset, and employed natural breaks classification to determine the class breaks for Flickr and YOLO maps in Figure 8. From Figure 8, we confirmed that the spatial distribution of eBird observations and Flickr photographs were similar to each other except few areas in which the magnitude and spatial extent of eBird and Flickr observations showed substantial differences.Both datasets indicated that high birdwatching activities take place around coastal areas and populous regions adjacent to metropolitan areas.While spatial patterns of birdwatching were similar between the two datasets, eBird was relatively more prominent in coastal areas of the North East, South East, West, Gulf Coast, and Great Lakes; national forests, prairie grass lands, wetlands, and areas where there was infrastructure for human access and birdwatching.While the magnitude of eBird density was much higher than Flickr across the nation, Flickr was relatively more prominent around urban areas such as New Orleans, Miami and Detroit.
Figure 9 illustrates the percentage of YOLO-detected Flickr bird images among both Flickr and eBird observations.This figure represents the bi-polar ratio of Flickr to eBird, and highlight areas where YOLO-detected Flickr photographs are above 1% using adaptive kernel smoothing that employs the 100 nearest users (both Flickr and eBird) to identify the neighborhood in the smoothing parameter.Figure 9 highlights prominent areas of Flickr bird photographs in natural lands that provide nesting, stopover, and overwintering habitat for birds.Interestingly, the spatial patterns were very distinct and different from fixed-distance density distribution, and provided a valuable input where Flickr usage was relatively higher in comparison to eBird.Regardless of the difference between the number of observations between Flickr and eBird, Flickr bird photographs were prominent (over 10%) in areas where there was access and infrastructure for birdwatching across the nation.The example areas where Flickr bird photographs were relatively higher are Grand Canyon River and Colorado Plato, Yellow Stone National Park, Southern Colorado, national preserves and wildlife areas in Southern Florida, and the wetland and prairie lands in the Mid-West.These Flickr users likely represent tourists who are not serious birdwatchers.Figure 9 illustrates the percentage of YOLO-detected Flickr bird images among both Flickr and eBird observations.This figure represents the bi-polar ratio of Flickr to eBird, and highlight areas where YOLO-detected Flickr photographs are above 1% using adaptive kernel smoothing that employs the 100 nearest users (both Flickr and eBird) to identify the neighborhood in the smoothing parameter.Figure 9 highlights prominent areas of Flickr bird photographs in natural lands that provide nesting, stopover, and overwintering habitat for birds.Interestingly, the spatial patterns were very distinct and different from fixed-distance density distribution, and provided a valuable input where Flickr usage was relatively higher in comparison to eBird.Regardless of the difference between the number of observations between Flickr and eBird, Flickr bird photographs were prominent (over 10%) in areas where there was access and infrastructure for birdwatching across the nation.The example areas where Flickr bird photographs were relatively higher are Grand Canyon River and Colorado Plato, Yellow Stone National Park, Southern Colorado, national preserves and wildlife areas in Southern Florida, and the wetland and prairie lands in the Mid-West.These Flickr users likely represent tourists who are not serious birdwatchers.

Discussion and Conclusion
In this article, we introduced an analytical framework that integrates a computer vision algorithm based on convolutional neural networks (CNN) with kernel density estimation to identify objects and infer human activity patterns from geo-tagged photographs.To demonstrate our framework, we inferred birdwatching activity by detecting birds from approximately 20 million publicly shared images on Flickr across a three-year period from December 2013 to December 2016.Our comparisons of Flickr and eBird observations highlight behavioral differences among the social media and citizen science users, which we further attribute to casual (Flickr) and serious birdwatching (eBird).
We have shown how the computer vision algorithm, YOLO, can be used for detecting objects and extracting semantics from geo-tagged and time-stamped social media images.Bird images classified only by the metadata but not with YOLO resulted in a substantially lower accuracy of 26%, while bird images detected by only YOLO resulted in an accuracy of 89%.Our case study in birdwatching, and comparisons of patterns captured from Flickr with the patterns from eBird observations highlight the biases in social media and citizen science data sets.While eBird helps identify serious birdwatching behaviors that are focused in particular areas across the US, Flickr patterns suggest more casual and spatially diverse birdwatching activities.The results of our analysis

Discussion and Conclusions
In this article, we introduced an analytical framework that integrates a computer vision algorithm based on convolutional neural networks (CNN) with kernel density estimation to identify objects and infer human activity patterns from geo-tagged photographs.To demonstrate our framework, we inferred birdwatching activity by detecting birds from approximately 20 million publicly shared images on Flickr across a three-year period from December 2013 to December 2016.Our comparisons of Flickr and eBird observations highlight behavioral differences among the social media and citizen science users, which we further attribute to casual (Flickr) and serious birdwatching (eBird).
We have shown how the computer vision algorithm, YOLO, can be used for detecting objects and extracting semantics from geo-tagged and time-stamped social media images.Bird images classified only by the metadata but not with YOLO resulted in a substantially lower accuracy of 26%, while bird images detected by only YOLO resulted in an accuracy of 89%.Our case study in birdwatching, and comparisons of patterns captured from Flickr with the patterns from eBird observations highlight the biases in social media and citizen science data sets.While eBird helps identify serious birdwatching behaviors that are focused in particular areas across the US, Flickr patterns suggest more casual and spatially diverse birdwatching activities.The results of our analysis provide valuable insights into the credibility and utility of geo-tagged photos in studying birdwatching activities, and show the potential for studying other human activity patterns through object detection using a large collection of geo-tagged and user-generated images.
While eBird data have been used in a wide variety of ornithological studies across broad spatial and temporal scales [19,20], and the data source has a number of significant biases due to a variety of reasons such as users, locations, and time periods.For example, while in the earlier periods citizen scientists collected information from a diverse set of species, in recent years citizen scientists have been biased towards collecting information on threatened species and protected areas [53].On the other hand, Flickr users are usually photographers who are also birdwatchers, who not only upload their images, but also decide to geotag and share them.Our results comparing the spatial distribution of the two datasets highlight similar results as well as some geographic variations, which can be attributed to the potential biases among citizen science and social media applications and users.While eBird users are more likely to travel long distances for bird observations, Flickr users are casual birdwatchers who are likely to take bird photographs around their usual activity spaces.Future studies on extracting the mobility patterns of eBird and Flickr users can help better understand the dynamics of birdwatching activities.
In future work, we plan to complete the accuracy evaluation of all images classified by both the metadata and YOLO deep learning library.In addition, we plan to evaluate other object detection libraries and compare the accuracy results with YOLO.Beyond the scope of our particular focus on birdwatching, we plan to identify characteristics of locations based on the objects detected in an area over a period of time.This way, we will examine whether object detection can be used to advance our understanding of places, and semantics embedded in those places, and identify similarities between places across the world.

Figure 1 .
Figure 1.Overview of the analytical workflow.

Figure 1 .
Figure 1.Overview of the analytical workflow.

Figure 2 .
Figure 2.An example of YOLOv3 (You Only Look Once) objection detection result.

Figure 2 .
Figure 2.An example of YOLOv3 (You Only Look Once) objection detection result.The initial 24 convolutional layers of the network extract features from the image, and the two fully connected layers predicted the output bounding boxes and class probabilities.The output of the system was stored as an S × S × (B * 5 + C) tensor.A typical network architecture with S = 7, B = 2, C = 20 is shown in Figure 3 [44].

Figure 7 .
Figure 7. Temporal patterns of YOLO-detected Flickr bird images and eBird observations.Figure 7. Temporal patterns of YOLO-detected Flickr bird images and eBird observations.

Figure 7 .
Figure 7. Temporal patterns of YOLO-detected Flickr bird images and eBird observations.Figure 7. Temporal patterns of YOLO-detected Flickr bird images and eBird observations.

Figure 8 .
Figure 8. Z-scores of the fixed-distance (20 miles) density of (a) YOLO-detected Flickr bird image and (b) eBird observations.

Figure 8 .
Figure 8. Z-scores of the fixed-distance (20 miles) density of (a) YOLO-detected Flickr bird image and (b) eBird observations.

Figure 9 .
Figure 9. Percent of YOLO-detected Flickr bird images computed by an adaptive kernel based on a minimum threshold of 100 users that contain both Flickr and eBird users.

Figure 9 .
Figure 9. Percent of YOLO-detected Flickr bird images computed by an adaptive kernel based on a minimum threshold of 100 users that contain both Flickr and eBird users.

Table 1 .
The number of images that contain at least one of the top 48 detected objects.