The Spillover E ﬀ ect of Geotagged Tweets as a Measure of Ambient Population for Theft Crime

: As a measurement of the residential population, the Census population ignores the mobility of the people. This weakness may be alleviated by the use of ambient population, derived from social media data such as tweets. This research aims to examine the degree in which geotagged tweets, in contrast to the Census population, can explain crime. In addition, the mobility of Twitter users suggests that tweets as the ambient population may have a spillover e ﬀ ect on the neighboring areas. Based on a yearlong geotagged tweets dataset, negative binomial regression models are used to test the impact of tweets derived ambient population, as well as its possible spillover e ﬀ ect on theft crimes. Results show: (1) Tweets count is a viable replacement of the Census population for spatial theft pattern analysis; (2) tweets count as a measure of the ambient population shows a signiﬁcant spillover e ﬀ ect on thefts, while such spillover e ﬀ ect does not exist for the Census population; (3) the combination of tweets and its spatial lag outperforms the Census population in theft crime analyses. Therefore, the spillover e ﬀ ect of the tweets derived ambient population should be considered in future crime analyses. This ﬁnding may be applicable to other social media data as well.


Introduction
Census population has long been utilized as one of the most important factors in analyzing issues related to crimes in the fields of criminology, sociology, economics and many more [1][2][3][4][5][6]. However, it is generally recognized that the vast majority of the available Census population data represent the residential population and cannot capture the ambient population distribution [7][8][9]. Consequently, using the Census population alone to analyze the crime would be potentially problematic and biased sometimes [6,[9][10][11][12][13][14][15]. Therefore, it is pressing to discover other complementary indicators of the ambient population with the high spatio-temporal resolution to advance the understanding of crime patterns.
Due to its timely availability and free accessibility, tweets have been widely used by researchers to model patterns of the ambient population [16][17][18]. Sims and colleagues (2017) use the Twitter posts and Facebook check-ins across a 24-hour period for football game days at the University of Tennessee, Knoxville, during the 2013 season to model the dynamics of the population distribution during the special event. After comparing the population distributions for game-hours and non-game-hours with the social media data, they successfully test the reliability of using social media data to improve the dynamic population distribution model [19]. Patel and colleagues (2017) collect two months of geotagged tweets in Indonesia in 2013 and use its density as a covariate layer to map the distribution of the population. This approach significantly increases the population mapping accuracy, argued by authors [20]. Therefore, it is obvious that geotagged tweets can be a significant factor for the location-based services because of its promise for more dynamic population distribution detection, especially when people are not at their residential locations [17][18][19][20][21].
In the field of environmental criminology, it is common knowledge that most of the criminal actions require the convergence of the motivated offender, the suitable target and the absence of the capable guardian in space and time. This routine activity approach highlights the importance of the dynamic population distribution on the crime. The development and rapid change in human society encourage people to have more routine activities away from their homes [22]. The increased mobility and the mobile range will certainly change the dynamic distribution of the population, which in turn, influences the crime pattern [23][24][25][26][27]. Moreover, Crime Pattern Theory suggests that both offenders and victims have their own behavior patterns, consisting of the same or different anchor locations. The crimes are more likely to happen in the intersected anchor locations of their activity spaces [28][29][30][31][32]. These anchor locations can be crime attractors, which create well-known criminal opportunities to which motivated criminals are attracted; or crime generators, which attract a large number of people who have no intention to commit any crime [29][30][31]. It is clear that crime attractors and generators have great potential to attract people; however, the number of people attracted to each attractor/generator is hard to discern from the Census data. The spatio-temporal information embedded in the Twitter posts (also known as tweets) can be a potential asset to the crime pattern analysis by acting as an ambient population index. It has the potential to capture dynamic population distribution. Moreover, crime data have a high spatio-temporal resolution, so is the tweet. In contrast, Census data in the US tend to have a low spatio-temporal resolution. Incorporating tweets should better explain crimes due to matching resolutions.
There have been a number of studies that touched on the relationship between tweets and crime. Gerber (2014) uses the semantic analysis to extract the topics of tweets and calculates the relative strength (weight) of individual topics as independent variables, along with the historical crime data to predict future crime in Chicago. The result suggests that the addition of Twitter-derived features improves prediction performance for 19/25 crime types than the model of solely historical crimes [33]. His approach adds 300-900 topics (each topic is an independent variable, and these topics are not necessarily related to crime) into the prediction model, which would almost surely increase the performance of the prediction model. The theoretical foundation of this research is relatively weak since it does not introduce sufficient criminology theory to justify the crime prediction model. In spite of this obvious disadvantage, this study demonstrates the benefits of tweets for crime prediction. Other scholars also try to explain the crime distribution with geotagged tweets. Bendler and colleagues (2014) use the amount of point of interest (POI) as the independent variable to simulate crime incidents happened in San Francisco from August through October of 2013. Then they add the count of tweets into the model and argue that the performance of the simulation model increases [34]. Ristea and colleagues (2017) try to use tweets count as an explanatory variable to assess the crime-tweets relationship, and they find the significant correlation between the two in aggregated crime types, anti-social behaviors, and other thefts [35]. However, none of these includes any socio-economic variable as controls when modeling the crime. Therefore, they cannot answer whether the relationship between tweets and crime is spurious. Three UK scholars derive "broken window" indicators (e.g., neighborhood degeneration) from the content of tweets [36]. They add the frequency of Twitter posts and the count of tweets containing "broken window" indicators into a model, with the control of necessary socio-economic variables, to simulate crimes that happened in 28 London boroughs from August 2013 to August 2014. They argue that naturally occurring social media data may provide an alternative information source on the crime problem [36]. By using a sample of tweets in the Southern California region from May 2015 to December 2015, Hipp and colleagues (2018) also suggest that this data source can help explain the existence of crime [15].
The aforementioned studies use the count of geotagged tweets or its transformation (e.g., natural logarithm) as a replacement of the Census population in their crime models with the presumption that these tweets can provide a better estimate of the ambient population than Census population [14,15]. However, how "better" can this new indicator of the ambient population be having not been thoroughly measured using the solid statistical method. Moreover, though a body of literature has demonstrated that the count of tweets is an appropriate indicator of the ambient population, should it serve as a replacement or complement of the residential population in the crime model is not tested. In addition, the mobility of Twitter users and the contents of tweets not being restricted to the tweets' location may suggest that tweets, as a measure of the ambient population, may have a spillover effect on the neighboring areas. However, such effect has never been tested for the tweet-crime relationship.
The spillover effect of tweets is important because it serves as an indicator of the ambient population, which contains mobility information that is not available in the Census population [8,14,37]. Moreover, the reason for using the ambient population in the model is to capture the non-residential population in the neighborhood. A Twitter user may tweet at the residential neighborhood, at the adjacent neighborhoods, at work location or other anchor locations of the routine activities. Consequently, embedded mobility information can be revealed by tweets. As crime is closely linked to the actual pattern of the ambient population [8,14,37], a measure of the ambient population's spillover effect should help in the explanation of crime patterns.
One of the approaches for representing this spillover effect is through the use of spatial lag of tweets based ambient population. By its definition, the spatial lag is a weighted average of one variable at "neighboring" locations, which can account for its spatial autocorrelation (i.e., the value of a variable in one unit is correlated with the values of the same variable in this unit's neighbors) [38][39][40]. The spatial lag regression model has been used in a few environmental criminology studies, by introducing the spatial lag of crime as an explanatory variable [38,41,42]. However, the spatial lag of the ambient population has not been included in any crime model to account for its spatial spillover effect. Given the potential importance of this spillover effect of the ambient population on the crime, this gap should be filled.
The aforementioned research gaps motivate us to tackle the following research questions: (1) Should tweets based ambient population be considered as a replacement or a complement of the residential population in theft crime models? (2) How does the spillover effect of tweets based ambient population contribute to the explanation of theft patterns? (3) Do tweets explain theft patterns better than the residential population?

Study Area and Data
The study area of this research is the City of Cincinnati, the core of the Greater Cincinnati Metropolitan Area. Cincinnati is within Hamilton County, OH. The University of Cincinnati is located in this city. Based on the data acquired from the Cincinnati Area Geographic Information System (CAGIS), the total area of Cincinnati is 206.01 km 2 , and this city is composed of 50 neighborhoods.
In the year of 2013, the total population of this city was 297,444. Figure 1 shows the spatial distribution of the Census population by neighborhood. Since using choropleth maps for presenting counts is inappropriate and against the cartographic rules [43], we choose the dot density map to display the population. One dot represents 200 individuals, and the neighborhood with more dots has a larger population. This map is created with ArcGIS 10.4.1, so as to the other maps presented in this paper.  Cincinnati Police Department provides us with the theft crime incidents data for the entire year of 2013. More than 99% of these incidents (11,742/11,819) are successfully geocoded based on their addresses. Figure 2 shows the spatial distribution of the thefts by neighborhood. The dot density map is also used here, and one dot represents 15 theft incidents. The neighborhood has more dots has more thefts.  Cincinnati Police Department provides us with the theft crime incidents data for the entire year of 2013. More than 99% of these incidents (11,742/11,819) are successfully geocoded based on their addresses. Figure 2 shows the spatial distribution of the thefts by neighborhood. The dot density map is also used here, and one dot represents 15 theft incidents. The neighborhood has more dots has more thefts.  Cincinnati Police Department provides us with the theft crime incidents data for the entire year of 2013. More than 99% of these incidents (11,742/11,819) are successfully geocoded based on their addresses. Figure 2 shows the spatial distribution of the thefts by neighborhood. The dot density map is also used here, and one dot represents 15 theft incidents. The neighborhood has more dots has more thefts.  To be consistent with theft crime data, geotagged tweets (N = 778,901) in Cincinnati for the year 2013 are retrieved by adopting a Python script initially written by Henrique [44]. Unlike the Twitter Streaming API which has a limitation of time constraints (no historical tweets) and volume constraints (at most one percent of all the tweets produced on Twitter) [45], this Python script is able to mimic the tweet Search on the browser [44]. Therefore, all the public geotagged tweets in Cincinnati are retrieved. The high representativeness of this dataset is a significant advantage of our study compared with earlier ones. The anonymous user ID, date and time, latitude and longitude of the collected tweets facilitate the analysis in this study. Figure 3 shows the spatial distribution of the geotagged tweets by neighborhood in the form of the dot density map. The more points are in the neighborhood, the more the tweets are. To be consistent with theft crime data, geotagged tweets (N = 778,901) in Cincinnati for the year 2013 are retrieved by adopting a Python script initially written by Henrique [44]. Unlike the Twitter Streaming API which has a limitation of time constraints (no historical tweets) and volume constraints (at most one percent of all the tweets produced on Twitter) [45], this Python script is able to mimic the tweet Search on the browser [44]. Therefore, all the public geotagged tweets in Cincinnati are retrieved. The high representativeness of this dataset is a significant advantage of our study compared with earlier ones. The anonymous user ID, date and time, latitude and longitude of the collected tweets facilitate the analysis in this study. Figure 3 shows the spatial distribution of the geotagged tweets by neighborhood in the form of the dot density map. The more points are in the neighborhood, the more the tweets are. To get a sense of the plausible spillover effect of tweets, we randomly select a neighborhood, Mount Auburn. There are a total of 12,149 tweets, posted by 997 unique users in this neighborhood during the year of 2013. We select the top 35 users who posted most of the tweets (7874 in total), based on the reasonable assumption that the most important anchor locations such as home or work of these users are in the neighborhood. Then all the tweets (18,938) posted by these 35 users in the remaining 49 neighborhood areas are retrieved ( Figure 4). This dot density map shows the spatial distribution of the tweets posted by these 35 users. The denser the points are, the more the tweets are in the neighborhood. It is obvious that the neighborhoods near Mount Auburn tend to have higher tweet counts, while distant neighborhoods have lower counts, indicating a distance decay pattern. A few exceptions exist because of the distribution of these users' other anchor locations of their daily activities. To get a sense of the plausible spillover effect of tweets, we randomly select a neighborhood, Mount Auburn. There are a total of 12,149 tweets, posted by 997 unique users in this neighborhood during the year of 2013. We select the top 35 users who posted most of the tweets (7874 in total), based on the reasonable assumption that the most important anchor locations such as home or work of these users are in the neighborhood. Then all the tweets (18,938) posted by these 35 users in the remaining 49 neighborhood areas are retrieved ( Figure 4). This dot density map shows the spatial distribution of the tweets posted by these 35 users. The denser the points are, the more the tweets are in the neighborhood. It is obvious that the neighborhoods near Mount Auburn tend to have higher tweet counts, while distant neighborhoods have lower counts, indicating a distance decay pattern. A few exceptions exist because of the distribution of these users' other anchor locations of their daily activities.  To further highlight the distance decay pattern, we also calculate these tweets' Euclidean distance to the centroid of Mount Auburn by using ArcGIS 10.4.1. Since the dimension of Mount Auburn is about 1500 meters, the interval of the bins for the histogram is set as 1500 meters. Figure 5 shows the histogram of the tweets in each distance bin, demonstrating an obvious distance decay phenomenon of tweets. The revealed distance decay phenomenon serves as an empirical foundation for the hypothesis of the spillover effect of tweets on theft crime. We also conduct the similar analysis on all 50 neighborhoods. Figure 6 shows the histogram of the tweets by routine users of each of the 50 neighborhoods, demonstrating an obvious distance decay To further highlight the distance decay pattern, we also calculate these tweets' Euclidean distance to the centroid of Mount Auburn by using ArcGIS 10.4.1. Since the dimension of Mount Auburn is about 1500 m, the interval of the bins for the histogram is set as 1500 m. Figure 5 shows the histogram of the tweets in each distance bin, demonstrating an obvious distance decay phenomenon of tweets. The revealed distance decay phenomenon serves as an empirical foundation for the hypothesis of the spillover effect of tweets on theft crime. To further highlight the distance decay pattern, we also calculate these tweets' Euclidean distance to the centroid of Mount Auburn by using ArcGIS 10.4.1. Since the dimension of Mount Auburn is about 1500 meters, the interval of the bins for the histogram is set as 1500 meters. Figure 5 shows the histogram of the tweets in each distance bin, demonstrating an obvious distance decay phenomenon of tweets. The revealed distance decay phenomenon serves as an empirical foundation for the hypothesis of the spillover effect of tweets on theft crime. We also conduct the similar analysis on all 50 neighborhoods. Figure 6 shows the histogram of the tweets by routine users of each of the 50 neighborhoods, demonstrating an obvious distance decay We also conduct the similar analysis on all 50 neighborhoods. Figure 6 shows the histogram of the tweets by routine users of each of the 50 neighborhoods, demonstrating an obvious distance decay from the centroid of the corresponding neighborhood. This suggests the distance decay presented in Figure 5 is not incidental. It is noticeable that the tweets count in Figure 6 drops faster than that of Figure 5. It is possible that peripheral neighborhoods experience more drastic distance decay than the ones near downtown, as the anchor points such as work locations of many users are centered in and near the downtown area. from the centroid of the corresponding neighborhood. This suggests the distance decay presented in Figure 6. Counts of tweets of routine users in a neighborhood by distance from the centroid of the Neighborhood for the City of Cincinnati. Moreover, the tract-level socio-economic data in 2013 are retrieved from the US Census Bureau, which include total population, population under the poverty line, unemployed population, population younger than age 18, median household income, total houses, total houses currently occupied, total vacant houses, houses occupied by renters, and population in different races. These socio-economic variables are aggregated to the neighborhood-level to capture the collective characteristics of the neighborhood as proposed by existing social science studies [46][47][48][49][50]. Neighborhood-level characteristics have been tested to be related to crimes [3,51,52]. The neighborhood is also more identifiable and familiar by the local residents, media, and community councils. The results of the neighborhood-level models make more sense to the public and thus, may draw more public attention, as well as benefit the community-oriented problem solving [53][54][55]. Deriving from the social disorganization theory [56], the poverty rate, unemployment rate, the young population (<18) rate, and the median household income are used to indicate the concentrated disadvantage and inequality [48,[57][58][59][60]. The housing rental rate and the housing vacancy rate are the indicators of residential instability [48,61,62]. The widely used ethnic heterogeneity index [32,63] is also added. This index is calculated as: where B, H, W, and O are the numbers of residents of Black, Hispanic, White, and Other ethnics, respectively, who are living in the neighborhood. A score of 0 means a completely homogeneous neighborhood, while a score of 1 implies the total heterogeneity. Additionally, points of interest (POI) data, including transit stations (bus, streetcar, train, etc.), ATMs, bank branches, bars, convenience stores, grocery stores, liquor stores, movie theaters, night clubs, recreational places, restaurants and shopping malls, are collected from the Google Map. The POI variable serves as the additional control in the statistical models since these crime generators/attractors often influence crime opportunities [ Distance from the centroid of the corresponding Neighborhood (m) Figure 6. Counts of tweets of routine users in a neighborhood by distance from the centroid of the Neighborhood for the City of Cincinnati. Moreover, the tract-level socio-economic data in 2013 are retrieved from the US Census Bureau, which include total population, population under the poverty line, unemployed population, population younger than age 18, median household income, total houses, total houses currently occupied, total vacant houses, houses occupied by renters, and population in different races. These socio-economic variables are aggregated to the neighborhood-level to capture the collective characteristics of the neighborhood as proposed by existing social science studies [46][47][48][49][50]. Neighborhood-level characteristics have been tested to be related to crimes [3,51,52]. The neighborhood is also more identifiable and familiar by the local residents, media, and community councils. The results of the neighborhood-level models make more sense to the public and thus, may draw more public attention, as well as benefit the community-oriented problem solving [53][54][55]. Deriving from the social disorganization theory [56], the poverty rate, unemployment rate, the young population (<18) rate, and the median household income are used to indicate the concentrated disadvantage and inequality [48,[57][58][59][60]. The housing rental rate and the housing vacancy rate are the indicators of residential instability [48,61,62]. The widely used ethnic heterogeneity index [32,63] is also added. This index is calculated as: where B, H, W, and O are the numbers of residents of Black, Hispanic, White, and Other ethnics, respectively, who are living in the neighborhood. A score of 0 means a completely homogeneous neighborhood, while a score of 1 implies the total heterogeneity. Additionally, points of interest (POI) data, including transit stations (bus, streetcar, train, etc.), ATMs, bank branches, bars, convenience stores, grocery stores, liquor stores, movie theaters, night clubs, recreational places, restaurants and shopping malls, are collected from the Google Map. The POI variable serves as the additional control in the statistical models since these crime generators/attractors often influence crime opportunities [6,[30][31][32]64,65].

Method
As the "Iron Law of Troublesome Places" indicates, few places are responsible for most of the crimes, and most places do not experience any crime, so the distribution of crime is always skewed [66][67][68]. Therefore, the negative binomial regression model is selected to analyze the theft crimes in Cincinnati as it does not assume homogeneity of variance [69]. It is a Poisson-based regression model suitable for an over-dispersed dependent variable and has been widely used in environmental criminology studies [32,60,69,70]. The unit of analysis is the neighborhood (N = 50).
The dependent variable is the theft count in each neighborhood. The independent variables are the Census population, the count of geotagged tweets, and the spatial lags of them. The spatial lags of the independent variables are included to address the potential spillover effect. The spatial lag of tweets in one neighborhood is calculated as the average number of all of its immediately adjacent neighbors' tweets counts. The spatial lag of the Census population is calculated in the same manner. The software used to perform these spatial lag calculations is GeoDa 1.14 [71]. The control variables are indicators of the concentrated disadvantage (poverty rate, unemployment rate, young population (<18) rate, and the median household income), the residential instability (housing rental rate and housing vacancy rate), and the ethnic heterogeneity. The count of crime generators/attractors is added as the additional control. In order to answer all the proposed research questions, six models with the same dependent variable are generated with different independent variables and the same set of control variables by Stata 15 [72]. Table 1 shows the descriptive statistics of variables used in the models.

Results
Three negative binomial models are firstly generated with different independent variables: Census population (Model 1), tweets count (Model 2), and both of the Census population and tweets count (Model 3) ( Table 2). The standardized coefficient (β) is calculated by multiplying the unstandardized coefficient by the ratio of the standard deviations of the independent variable and the dependent variable [73,74]. Model 1's result shows that after controlling for the necessary socio-economic variables and crime generators/attractors, the Census population shows a significantly positive influence on thefts (β = 0.111, p-value = 0.002). This is aligned with common sense and the previous studies, as crime is a function of population distribution [4,75,76]. Model 2 replaces the Census population with the measure of the ambient population, tweets count. The result is promising: the count of tweets is also positively related to thefts at the 0.01 significance level (β = 0.059, p-value = 0.008).
Thus, the tweet count is a viable ambient population measure for theft crime analysis, as suggested by earlier studies [8,11,14,15,35]. Model 3 treats tweets count as a complementary index of the Census population. Both of the Census population and tweets are included in this model. Census population appears to have a statistically significant positive effect on thefts (β = 0.095, p-value = 0.018), while tweets, acting as a complementary ambient population index, does not show a statistically significant effect (p-value = 0.172). Akaike information criterion (AIC) and Bayesian information criterion (BIC) are used to compare the model fit and complexity of the model. Lower AIC/BIC values indicate a better model fit [77,78]. Within the aforementioned three models, Model 1 has the lowest AIC (580.033) and BIC (601.065), while Model 2's criterions are slightly larger (AIC = 582.697; BIC = 603.729). Clearly, when analyzing thefts, the model composed of Census population and control variables (Model 1) has the best model fit, but the model composed of tweets and control variables (Model 2) has a comparable model fit (less than 1% difference). In summary, tweets should not be used as a complementary index of the Census population. This implies the Census population and tweets should not be included into the same statistical model when analyzing theft crime patterns. However, tweets could be used as a replacement of the Census population as it is supposed to show the ambient population distribution. Nevertheless, the model fit of the tweets model (Model 2) is not necessarily better than the Census population model (Model 1).
To further assess the spillover effect of tweets count on thefts, three additional negative binomial models are generated with different independent variables, as well as their spatial lags. They are Census population with its spatial lag operator (Model 4), tweets count with its spatial lag operator (Model 5), and both of the Census population and tweets count, as well as their spatial lag operators (Model 6) ( Table 2). Model 4 is composed of the Census population, its spatial lag operator and control variables. Similar to the result of Model 1, the Census population still has a significantly positive influence on thefts (β = 0.095, p-value = 0.017), while its spatial lag operator does not (β = 0.038, p-value = 0.181). This is in line with our hypothesis that the residential population cannot store the mobility information, and therefore, does not show a significant spillover effect on thefts. Model 5's result suggests a different story: tweets count remains statistically significant on the positive effect on thefts (β = 0.073, p-value = 0.002). Meanwhile, its spatial lag operator also shows a significantly positive effect on thefts (β = 0.113, p-value < 0.001). The spatial lag operator of tweets has an even more significant (p-value < 0.001 vs. p-value = 0.002) and stronger (β = 0.113 vs. β = 0.073) influence on thefts than tweets count itself. This result supports our hypothesis that tweets count as a measure of the ambient population, can capture the non-residential population, and has a significant spillover effect on theft crimes. Model 6 is composed of the Census population, tweets count, and their spatial lag operators. In this model, while the Census population variable is not significant (p-value = 0.194), the effect of tweets is marginally significant (β = 0.057, p-value = 0.048). This also supports the findings from the Model 2 and Model 3: tweet as a measure of the ambient population should be considered as a replacement of the Census population, rather than a compliment. Additionally, among all six models, Model 5 has the lowest AIC (566.890) and BIC (589.834). In these models, the count of crime attractors/generators is consistently significant with positive coefficients in every model, indicating a strong and significant relationship between thefts and crime attractors/generators. Housing rental rate is significant in models 1, 3 and 4 with positive coefficients. The poverty rate is significant in models 5 and 6 with negative coefficients.
The check of model residuals' spatial autocorrelation indicates that the addition of the spatial lag makes the model residuals randomly distributed across space, which further suggests the importance of the spillover effect of the population measure. Such evidence further confirms that tweets based ambient population and its spatial lag outperform the residential population in modeling theft crime. Thus, the answers to the research questions should be: (1) Tweets count as a measure of the ambient population should be considered as a replacement of the residential population in theft crime models; (2) The spillover effect of tweets count as a measure of the ambient population on theft crime pattern analysis is significant. Meanwhile, the Census residential population does not show any significant spillover effect on theft crimes; (3) Tweets count as an ambient population measure alone does not necessarily explain theft pattern better than the Census population when analyzing theft crimes; however, the model composed of tweets and its spatial lag has a better model fit than that of Census population. Thus, it is safe to confirm that tweets can indeed be used as a viable measure of ambient population that can function as a replacement of the Census population in crime analyses, as previous research has suggested [15,33,35,36,41,42], and more importantly, the combination of tweets and its spatial lag operator outperform residential population in modeling crime.

Discussion and Conclusions
This study collects all the searchable public geotagged tweets in Cincinnati, and assesses its relationship with crime patterns, with the necessary socio-economic and crime generator/attractor variables controlled. Results of the negative binomial models indicate that tweets can be used as a measurement of the ambient population for crime analysis. This is highly consistent with the findings of previous studies [15,33,35,36,79]. Another highlight of this study is the successful detection of the spillover effect of tweets. This is to say, crimes in a neighborhood area can be explained by tweets in its surrounding neighborhoods. Essentially tweets capture mobility information, as is revealed in a distance pattern that shows the number of tweets decline from the main anchor locations of the daily activities of Twitter users to distant places. On the contrary, the Census population cannot store the mobility information, thus, such spillover effect does not exist for the Census population. The ability of geotagged tweets capturing the mobility of Twitter users [8,14,37] makes tweets derived ambient population superior to the Census population in representing the dynamic distribution of the population. In order to alleviate the potential bias caused by the modifiable areal unit problem (MAUP) [80,81], the models are also tested at both the Census block group and Census tract levels.
The results show that this spillover effect of tweets derived ambient population also consistently exits at these finer levels.
Tweets-derived ambient population has significantly higher spatio-temporal resolution than the Census population, which makes it a better indicator of the dynamic population distribution [13,15,76]. While being only counted in a single neighborhood in the Census data, an individual can post tweets in not only the home neighborhood but also other neighborhoods the individual frequents. Thus, tweets can store the mobility information, which is not available in the Census residential population. It should be acknowledged that commuting flows are available in the form of origin-destination (OD) matrix in some countries. However, people frequent additional anchor locations besides homes and offices, such as their favorite restaurants, grocery stores, etc. Flows related to these additional anchor points are not available in the OD matrix. Tweets derived ambient population may capture the additional mobility information. The high concentration of tweets in the space may indicate the clustering of a large amount of population, such as the sports games, musical festivals, or other unusual events that can potentially affect public safety. The law enforcement agencies can use the information from social media like tweets to detect potential problems. One example is Raven911 created by Ohio-Kentucky-Indiana Regional Council of Governments (OKI). This internet-based mapping system is used by first responders during emergency situations, such as inclement weather, threats of fire, chemical leaks and even terrorist threats [82,83]. Officers can use this in-house mapping system to capture the real-time social media posts in a small area (e.g., several blocks) and monitor the situation to avoid the potential harm to public safety. Similar applications are also seen in other places [84][85][86][87]. The spillover effect detected in this study can help advance the developing framework of these applications: the small area is useful and easy to monitor, however, since the spillover effect exists, the ability to detect the emerging trend in the nearby areas is needed as well.
We acknowledge the non-representativeness of Twitter user composition comparing with the Census population. It has been studied that the major Twitter users are relatively young [15,[88][89][90][91][92][93][94][95], African American [89,90,94], and urban residents [15,89,[92][93][94][95]. However, it has long been recognized that the youth are disproportionately likely to be involved in criminal activities, either as offenders [96,97] or victims [97][98][99]. Coincidentally, young people are more likely to tweet, and young people are more likely to get into trouble. In this sense, given the fact that the young population is controlled in the statistical models, the non-representativeness of tweets should not be an unacceptable bias. Moreover, this skewed user composition can actually help explain the spillover effect of tweets on crimes: tweets tend to capture the dynamic distribution of young people [88][89][90][91] and crimes tend to happen where young people are [96][97][98][99]. Another limitation of tweets is that only 4.2% of all Twitter users typically decide to share their location when posting [100]. Also, a person might only share the location of tweets at selected locations. In addition to tweets, social media posts in other social networking sites such as Facebook may provide more "comprehensive" coverage of the population since it is the largest global social network [101][102][103]. Unfortunately, the Facebook data containing detailed location information are not publicly available. These limitations could be sources of bias. However, the spatial coverage of these geotagged tweets matches that of the crimes in the study area (Figures 2 and 3). In addition, model results clearly underscore the reliability of the tweets count as a measure of the ambient population as suggested by earlier studies [15,33,35,36,41,79,104,105].
In conclusion, analysis of a yearlong tweets dataset, covering all searchable public geotagged tweets in Cincinnati, confirms the plausible spillover effect of tweets, as a measure of the ambient population on theft crimes. Results of negative binomial regression models composed of tweets count, Census population, their spatial lags and necessary socio-economic variables, as well as the crime attractors/generators lead to three major findings: (1) Tweets count is a viable replacement of the Census population for spatial theft analysis; (2) tweets count as a measure of the ambient population shows a significant spillover effect on thefts, while such spillover effect does not exist for the Census population; (3) the combination of tweets and its spatial lag outperforms the Census population in theft crime analyses. Thus, the spillover effect of tweets as a measure of the ambient population should not be overlooked in crime analyses. This finding may be applicable to other social media data as well. The spillover effect may also be seen on other ambient population measures such as the taxi ridership, cellphone user location, etc. Further, any routine activity related research such as health and safety may benefit from this study.