Solving Competitive Location Problems with Social Media Data Based on Customers ’ Local Sensitivities

Competitive location problems (CLPs) are a crucial business concern. Evaluating customers’ sensitivities to different facility attractions (such as distance and business area) is the premise for solving a CLP. Currently, the development of location-based services facilitates the use of location data for sensitivity evaluations. Most studies based on location data assumed the customers’ sensitivities to be global and constant over space. In this paper, we proposed a new method of using social media data to solve competitive location problems based on the evaluation of customers’ local sensitivities. Regular units were first designed to spatially aggregate social media data to extract samples with uniform spatial distribution. Then, geographically weighted regression (GWR) and the Huff model were combined to evaluate local sensitivities. By applying the evaluation results, the captures for different feasible locations were calculated, and the optimal location for a new retail facility could be determined. In our study, the five largest retail agglomerations in Beijing were taken as test cases, and a possible new retail agglomeration was located. The results of our study can help people have a better understanding of the spatial variation of customers’ local sensitivities. In addition, our results indicate that our method can solve competitive location problems in a cost-effective way.


Introduction
In most real situations, it is important to consider the competition between retail facilities in location decisions, namely, competitive location problems [1][2][3][4].The aim of competitive location is to locate a new retail facility or agglomeration at a location that can maximize its capture.Evaluating customers' sensitivities to different facility attractions (such as distance and business area) is the premise for solving a competitive location problem [5,6].Based on the evaluation results, the optimal location for a new retail facility that provides the largest capture can be determined.
Traditionally, data for evaluating customers' sensitivities has been mainly obtained from surveys and questionnaires.By investigating customers, much customer-related information can be collected, including home locations and visitation frequencies for given retail facilities.The information obtained was relatively complete and accurate.However, the methods of collecting traditional data (such as surveys) are labor intensive and time consuming [7].Additionally, the spatial distribution of traditional data is uneven, and the data size is limited [8,9].Because of the disadvantages of traditional data, the accuracy of sensitivity evaluations using traditional data was relatively low [10,11].Other data were needed to solve competitive location problems.Location data might provide a solution [10,11].
With the development of location-based services, location data (such as mobile phone location data, taxi trajectory data and social media data) provide new opportunities for evaluating customers' sensitivities to different attractions [10][11][12][13][14]. Compared with traditional data, location data is more widely distributed, and the data size is much larger.Lu et al. [10] designed an experiment for evaluating customers' sensitivities with mobile phone location data and revealed the effects of sample location.Yue et al. [11] applied taxi trajectory data to sensitivity evaluations and delimited the spatial distribution of customers for target retail agglomerations.Based on social media data, Qu et al. [12] and Hu et al. [13] discussed how distance influences customers' visitation behavior.By using the Huff model, Wang et al. proposed an effective method to extract samples from social media data that are suitable for delimitating trade areas [14].In their study, Wang et al. investigated customers' global sensitivities to distance and business areas quantitatively.All of these studies were conducted based on the assumption that the customers' sensitivities are global and spatially homogeneous.However, owing to local differences in sociodemographic characteristics (such as the density and the income of the population), customers' sensitivities were spatially heterogeneous.To date, no studies exist regarding how to accurately use location data to evaluate customers' local sensitivities to facility attractions.
In this paper, we proposed a new method for using social media data to solve competitive location problems by accurately evaluating customers' local sensitivities.Based on the proposed method, we will try to address the following research questions: (1) What are the characteristics of spatial distribution of customers' local sensitivities?(2) To what extent can the method which combines Huff model and geographically weighted regression (GWR) evaluate the customers' local sensitivities in a high spatial resolution?(3) Can social media samples be a reliable data source for the evaluation of customers' local sensitivities?
Our method includes 3 main steps: sample extraction, local sensitivity evaluation and capture estimation.Regular spatial units were first designed to extract samples with uniform spatial distribution by spatially aggregating social media data.Then, the Huff model and GWR were combined to evaluate the customers' local sensitivities.The Huff model is one of the most widely used competitive location models, and the sensitivity parameters in this model were used to represent the customers' sensitivities [15,16].Finally, through comparative analysis of the local and global sensitivities, suitable evaluation results were obtained for capture estimation.Based on the capture estimation, we took the feasible area with the largest capture as the optimal area for a new retail facility.The contributions of our study are twofold.First, the results of our study can help people have a better understanding about the spatial variation of customers' sensitivities.Second, our study provides a cost-effective way to evaluate customers' local sensitivities and solve competitive location problems with social media data.

Sina Weibo
Sina Weibo is one of the largest social media services in China and is considered to be the "Chinese Twitter" [17].As of March 2018, the number of active daily social media users had reached 184 million [18].On the Sina Weibo platform, users can contact each other and post messages called "microblogs".The form of the microblog can be pictures, webpage links, video links or text with a 140-Chinese-character limit.With the development of location services, location could also be appended to microblogs.In addition, Sina Weibo provided a set of application programming interfaces (APIs) for collecting microblogs, comments and the public information of users.In this study, we collected geotagged microblogs within a given time period and spatial area by applying the Sina API named "place/nearby_timeline".

Competitive Location Approach
Many approaches have been proposed to solve competitive location problems [6].These approaches range from the simple, such as the proximity model, to the sophisticated, such as the Huff model.All the approaches require a large number of samples to evaluate the customers' sensitivities to facility attractions, except the proximity model [19].
The proximity model was first proposed by Hotelling in 1929 [20].This model considers the location of two competitive facilities on a segment based on the assumption that distance is the only facility attraction.If one facility is already located on a segment, the location of this facility divides the segment into two parts.A new facility can be located on the longer part of the segment.This approach is not widely applied since it ignores other facility attractions (such as business area) [21].
To overcome the disadvantages of the proximity model, the deterministic utility approach was introduced to solve competitive location problems [22].This approach first requires many samples to estimate the utility function parameters that represent the customers' sensitivities.Then, the utilities can be calculated by using the estimated parameters.Last, the approach transforms the utility into a distance markup, and the break-even distance is obtained.The break-even distance refers to the maximum distance that a customer is willing to accept to visit a farther facility.A new facility can be located within the break-even distance.One problem with this approach is the assumption that all customers in the same spatial area are willing to visit the same facility [23].
The random utility approach can be considered to be an extension of the deterministic utility approach [24].The random utility approach applies the multivariate normal distribution to measure the utilities of competitive facilities.Based on the utilities, the probability that customers visit the target facility is calculated.After calculating the probabilities, the captures for new facilities and the optimal location can be obtained.This approach uses the random distribution of the utility functions to overcome the problem of the deterministic utility approach [25].The disadvantage of the random utility approach is that the utility decreases slowly for small distances and sharply for large distances [26].
The Huff model is one of the most widely used approaches in the field of competitive location studies [15].This model assumes that the customers are sensitive to the business area of the facility and the distance [27].The customers' sensitivities are represented by the sensitive parameters in this model [28].The Huff model formula is: where P ij is the probability that customers located in spatial area i visit the facility or agglomeration j, B j is the business area of the retail facility or agglomeration j, D ij is the distance between the spatial area i and the retail facility or agglomeration j, n is the number of competitive facilities within the study area, and α and λ are the sensitive parameters of business and distance, respectively.These two sensitive parameters were originally considered to be global and were defined as 1 and −2.Because the customers' sensitivities are spatially heterogeneous, the sensitive parameters are local.The Huff model with local parameters can be expressed as follows: where α i and λ i are the local sensitive parameter in spatial area i.Compared with other methods, the attractions considered by the Huff model are relatively complete, and the formula is more reasonable.Therefore, we applied the Huff model to solve the competitive location problem in this study.

Study Area
The area surrounded by the fifth ring road in Beijing is taken to be the study area.Beijing is the capital of China and is the second largest metropolis in China.With the development of this metropolis, many retail agglomerations formed.The largest five retail agglomerations in Beijing were taken as test cases, and the location for a new retail agglomeration was determined in this study.The location of each retail agglomeration and study area are shown in Figure 1."Z", "W", "G", "X" and "C" represent the retail agglomerations Zhongguancun, Wangfujing, Guomao, Xidan and Chaowai, respectively.Each agglomeration has a relatively convenient traffic pattern and can attract a large number of customers every day.where   and   are the local sensitive parameter in spatial area i.Compared with other methods, the attractions considered by the Huff model are relatively complete, and the formula is more reasonable.Therefore, we applied the Huff model to solve the competitive location problem in this study.

Study Area
The area surrounded by the fifth ring road in Beijing is taken to be the study area.Beijing is the capital of China and is the second largest metropolis in China.With the development of this metropolis, many retail agglomerations formed.The largest five retail agglomerations in Beijing were taken as test cases, and the location for a new retail agglomeration was determined in this study.The location of each retail agglomeration and study area are shown in Figure 1."Z", "W", "G", "X" and "C" represent the retail agglomerations Zhongguancun, Wangfujing, Guomao, Xidan and Chaowai, respectively.Each agglomeration has a relatively convenient traffic pattern and can attract a large

Data Collection and Preprocessing
In this study, we collected Sina Weibo data posted within the study area based on the API provided by the Sina corporation.The Sina API is similar to the Twitter API.Both APIs only return no more than 1% of all messages and can collect geotagged messages posted within a circle with a given center and radius [29,30].To our knowledge, there are also some differences between Sina API and Twitter API.By using Sina API, we can set the ending and starting time of the microblogs which we want to collect.The maximum time range of geotagged microblogs we can collect is 30 days.While, Twitter API need the identifications of Twitter as input rather than the ending and starting time.The maximum time range of geotagged Twitter we can collect is 7 days.
A set of 16,682,330 geotagged microblogs posted between January 1, 2014 and February 28, 2015 were collected.Each microblog in our dataset contains more than 50 attributes.These attributes reveal the detail information related to the microblog and its publisher.Data samples with some important

Data Collection and Preprocessing
In this study, we collected Sina Weibo data posted within the study area based on the API provided by the Sina corporation.The Sina API is similar to the Twitter API.Both APIs only return no more than 1% of all messages and can collect geotagged messages posted within a circle with a given center and radius [29,30].To our knowledge, there are also some differences between Sina API and Twitter API.By using Sina API, we can set the ending and starting time of the microblogs which we want to collect.The maximum time range of geotagged microblogs we can collect is 30 days.While, Twitter API need the identifications of Twitter as input rather than the ending and starting time.The maximum time range of geotagged Twitter we can collect is 7 days.
A set of 16,682,330 geotagged microblogs posted between 1 January 2014 and 28 February 2015 were collected.Each microblog in our dataset contains more than 50 attributes.These attributes reveal the detail information related to the microblog and its publisher.Data samples with some important attributes are shown in Table 1.The attributes in the Table 1 were introduced as follows: (1) "id", "created_at", "text" and "user_id" refer to the identification, posting time, text and user identification of the microblog, respectively; (2) "geo" refers to the posting location; (3) "retweet_status" can reveal whether the microblog is original."1" means that this microblog reposts (retweet) other microblogs."0" indicates that this microblog is original; (4) "POI_id" and "POI_title" refer to the identification and mane of the POI which users checked in.These two attributes in some microblogs are null.This is because some users post microblogs without checking in any POIs; (5) "source" refers to the name of application or phone model which users applied to post microblogs.To filter out the noise and outliers in the social media dataset, the microblogs were preprocessed.The noise mainly refers to the advertisements and the microblogs which come from non-human sources, namely, bots [31][32][33].Compared to the microblogs without location information, geotagged microblogs contained less noise and were more reliable.This is because geotagged microblogs in the Sina Weibo platform are all original and a large amount of noise is reposting microblogs (similar to retweets).The samples of noise were shown in Table 1.The noise among 100,000 randomly selected microblogs was first manually identified by members in our research group.Then, by analyzing the attributes of noise, two findings can be concluded: (1) most microblogs with some particular symbols in their texts, such as "【】", were advertisements; (2) most microblogs posted by bots have a particular "source", such as "unapproved application" and "PP time machine".By filtering out the microblogs with particular symbols and "source", 16,669,258 microblogs were retained.The detail information about the changes of our dataset was shown in Table 2.After filtering out noises in the dataset, we removed the outliers.Some users may post a large amount of microblogs with the same location information in a short time.These microblogs will influence the final results of our study and can be considered as outliers [34].Based on the study of Rzeszewski et al. [34], we restricted geotagged microblogs to one location per user in our case.Specifically, no matter how many microblogs with the same location information one user post, we only retain one microblog for one user in one week.Finally, as shown in Table 2, 16,664,073 geotagged microblogs posted by 2,428,294 users were retained for further analysis.

Method
In this section, we detail a new method of using social media data to solve competitive location problems by accurately evaluating customers' local sensitivities to facility attractions.The framework of our method is shown in Figure 2. First, we extracted the home location from the geotagged social media data for each user.To overcome the disadvantages of traditional samples (such as uneven distribution and limited data size), samples with uniform spatial distribution were extracted by spatially aggregating the home locations of users.Then, based on the samples, GWR and the Huff model are combined to evaluate the customers' local sensitivities.Last, the captures for different feasible locations were estimated, and the optimal location for a new retail agglomeration was determined.
ISPRS Int.J. Geo-Inf.2019, 8, x FOR PEER REVIEW 6 of 16 influence the final results of our study and can be considered as outliers [34].Based on the study of Rzeszewski et al [34], we restricted geotagged microblogs to one location per user in our case.Specifically, no matter how many microblogs with the same location information one user post, we only retain one microblog for one user in one week.Finally, as shown in Table 2, 16,664,073 geotagged microblogs posted by 2,428,294 users were retained for further analysis.

Method
In this section, we detail a new method of using social media data to solve competitive location problems by accurately evaluating customers' local sensitivities to facility attractions.The framework of our method is shown in Figure 2. First, we extracted the home location from the geotagged social media data for each user.To overcome the disadvantages of traditional samples (such as uneven distribution and limited data size), samples with uniform spatial distribution were extracted by spatially aggregating the home locations of users.Then, based on the samples, GWR and the Huff model are combined to evaluate the customers' local sensitivities.Last, the captures for different feasible locations were estimated, and the optimal location for a new retail agglomeration was determined.

Sample Extraction
To extract samples with uniform spatial distribution, regular spatial units were applied to spatially aggregate the social media data.Based on Equation (2), there are three types of sample attributes: the distance between the home location and the retail agglomeration; the visitation probability for the retail agglomeration; the business area of the retail agglomeration.The method for calculating the attributes for each aggregated sample is discussed next.

Home Location Extraction
Extracting home locations of the users who are attracted by retail agglomerations is the basis of calculating the distance, which is an important sample attribute.In this study, we first identified the attracted users.Then, the home location for each attracted user was extracted.Because the business hours of most retail facilities are from 9:00 AM to 10:00 PM [11], the users who posted microblogs when they were located at the retail agglomerations during this time period were identified as attracted users.A total of 87,171 attracted users were identified from our dataset.By applying the method proposed by Qu et al [12], we then extracted the home location of each attracted user.Finally, the home locations of 31,382 attracted users have been obtained effectively.Based on the extraction results, we find that a large amount of Sina Weibo users posted very few geotagged social media data which is not enough for extracting their home location.Similar finding has also be proved by Rzeszewski et al [34].

Sample Extraction
To extract samples with uniform spatial distribution, regular spatial units were applied to spatially aggregate the social media data.Based on Equation (2), there are three types of sample attributes: the distance between the home location and the retail agglomeration; the visitation probability for the retail agglomeration; the business area of the retail agglomeration.The method for calculating the attributes for each aggregated sample is discussed next.

Home Location Extraction
Extracting home locations of the users who are attracted by retail agglomerations is the basis of calculating the distance, which is an important sample attribute.In this study, we first identified the attracted users.Then, the home location for each attracted user was extracted.Because the business hours of most retail facilities are from 9:00 AM to 10:00 PM [11], the users who posted microblogs when they were located at the retail agglomerations during this time period were identified as attracted users.A total of 87,171 attracted users were identified from our dataset.By applying the method proposed by Qu et al. [12], we then extracted the home location of each attracted user.Finally, the home locations of 31,382 attracted users have been obtained effectively.Based on the extraction results, we find that a large amount of Sina Weibo users posted very few geotagged social media data which is not enough for extracting their home location.Similar finding has also be proved by Rzeszewski et al. [34].

Spatial Aggregation
Because of the uneven distribution and limited data size, traditional samples cannot be applied to evaluate the local sensitivity with accuracy [10,11].To provide reliable data support for the local sensitivity evaluation, we applied the method proposed by Wang et al. to spatially aggregate social media data and calculate sample attributes [14].The method has been introduced in detail in the study of Wang et al. [14].Based on this method, regular 600 m * 600 m grids were designed to spatially aggregate the home locations of the attracted users.Through spatial aggregation, all grids were retained and a set of 1411 aggregated samples were obtained.These samples are uniformly distributed and can reflect the overall visitation behavior of attracted users in each spatial unit.Therefore, compared with traditional samples, aggregated samples are more suitable for the local sensitivity evaluation in each unit.

Local Sensitivity Evaluation
Evaluating customers' sensitivities is the premise of calculating the capture of a new retail facility or agglomeration [5,6].Based on the samples extracted, the Huff model and GWR method were applied to evaluate the customers' local sensitivities.The GWR was proposed by Brunsdon et al. [35] and Fotheringham et al. [36] and assumed that closed locations have similar values.The GWR is an effective method of evaluating the spatial variation in the relationships between variables across the entire space [37,38].Therefore, the original formula of GWR was suitable for evaluating local sensitivities in this study.The formula is expressed as: where y i is the dependent variable, x ik is the independent variable, p is the number of independent variables, (u i , v i ) are the coordinates of spatial unit i, ε i the is random error, and β k (u i , v i ) is the regression parameter in spatial unit i and is the function of coordinates (u i , v i ).
Because GWR can only deal with linear models, the Huff model with local sensitive parameters (Equation ( 2)) was transformed to the linear model by using Nakanishi and Cooper's transformation [39].The transformed Huff model is expressed as: where P ij is the probability that attracted users located in unit i visit the retail agglomeration j, P i , S and D i are the geometric means of P ij , S j and D ij , respectively, and α i and λ i are the local sensitive parameters of business area and distance in spatial unit i, respectively.The local sensitive parameters were treated as the regression parameters in Equation (3).By combining Equation (3) with Equation ( 4), the sensitive parameters in spatial unit i were estimated by following formula: where X and y are the matrices of the observed independent and dependent variables, respectively; different spatial units have divergent impacts on the evaluation of the target unit i, and these impacts were quantified in the weight matrix W i .The weight matrix is shown as follows: where w in represents the weight value between unit n and target unit i.
Here, the weighting scheme W i is a distance-decay function that is a "bell" shape.Many functions can be used for the weighting scheme.Based on the theory of Fotheringham [38], compared to many other functions, the calculative efficiency of bi-square function is higher.Therefore, a bi-square function is applied in this case.A bi-square is a type of Gaussian function and can be expressed as follows: where d ij and w ij are the distance and weight between units i and j, respectively, and the bandwidth b is the key controlling parameter and is used to exclude the units that are farther than the distance threshold.Specifically, the bandwidth can determine the number of nearby units that are used for evaluating the local sensitive parameters in the target unit [40].
Finding the optimal bandwidth is an important step of the local evaluation.The Akaike Information Criterion (AICc) is first proposed by Akaike et al. to optimal the bandwidth [41].Compared to many other indices, the formula of AICc is simpler and can be applied to find the optimal number of neighbors more effectively [40,41].Therefore, AICc is introduced in this case.The formula of AICc is defined as follows: where n is the number of spatial units, σ is the estimated standard deviation of the error term, and tr(S) is the trace of the hat matrix S. Lower AICc values represent more suitable bandwidth and better model performance.Through an iterative optimization process, a best-fit bandwidth can be determined by minimizing the value of AICc [42,43].
In addition to the AICc, the coefficient of determination R 2 was also applied in our case for estimating the accuracy of the local sensitive parameter evaluation.R 2 provides a measurement of how well observed outcomes can be replicated by the model.R 2 is calculated by following formula: where y, ŷi and y are the observed, estimated and average values of the visitation probability, respectively.The higher values of R 2 indicate that a larger proportion of the total variation of the outcomes can be explained by the GWR method.
The collinearity among the covariates is a problem that should be considered in the GWR model [44].Local collinearity may appear when weight values in nearby units are high, and the sample sizes in these units are low.Local variance inflation factors (VIFs) and condition numbers (local-CN) were applied to detect the existence of local collinearity.As a general rule proposed by Belsley et al. [45], collinearity may exist for local-CNs that are greater than 30 or VIFs that are greater than 10.In addition to testing the local collinearity problem, the significance of each sensitive parameter that was evaluated was checked using the t-test.

Capture Estimation
By using the evaluation results of customers' local sensitivities, the captures for feasible locations were estimated to determine the optimal location for a new retail agglomeration.Three steps were included in this process: (1) feasible location identification; (2) visitation probability calculation; and (3) capture estimation.To identify the feasible locations, the areas with important infrastructures, scenic spots and government buildings were first removed based on a land use map.Then, the remaining areas were divided into plots.Each plot was geographically represented by its geometric center and could contain the maximum business area of a new agglomeration sized 80,000 m 2 .From these plots, we selected three suitable plots as feasible locations A, B and C, as shown in Figure 3.
was evaluated was checked using the t-test.

Capture Estimation
By using the evaluation results of customers' local sensitivities, the captures for feasible locations were estimated to determine the optimal location for a new retail agglomeration.Three steps were included in this process: (1) feasible location identification; (2) visitation probability calculation; and (3) capture estimation.To identify the feasible locations, the areas with important infrastructures, scenic spots and government buildings were first removed based on a land use map.Then, the remaining areas were divided into plots.Each plot was geographically represented by its geometric center and could contain the maximum business area of a new agglomeration sized 80,000 m 2 From these plots, we selected three suitable plots as feasible locations A, B and C, as shown in Figure 3.The visitation probabilities for new retail agglomerations on the feasible locations were calculated.If a new retail agglomeration was placed on the feasible location F, the probability that customers located in unit i visit the new retail agglomeration can be calculated by: where  is the shortest network distance between unit i and feasible location F, B is the business area of the new retail agglomeration, n is the number of existing retail agglomerations, and  and  are the local sensitive parameters in spatial unit i.If the sensitive parameters were considered to be global over space,  = ⋯ =  ⋯ =  and  = ⋯ =  = ⋯  .The visitation probabilities for new retail agglomerations on the feasible locations were calculated.If a new retail agglomeration was placed on the feasible location F, the probability that customers located in unit i visit the new retail agglomeration can be calculated by: where D iF is the shortest network distance between unit i and feasible location F, B is the business area of the new retail agglomeration, n is the number of existing retail agglomerations, and α i and λ i are the local sensitive parameters in spatial unit i.If the sensitive parameters were considered to be global over space, Based on the visitation probabilities, the capture of the new retail agglomeration can be calculated as follows: where n is the number of spatial units, and Y i is the buying power of unit i.Buying power in each spatial area can be replaced by the population [5,6].In recent years, the results of certain studies indicated that geotagged social media data can be used to approximately represent relative population density [13,31,46].Therefore, in our case, the number of home locations of social media users was treated as the relative buying power.The results of the capture calculation are presented and analyzed in next section.

Results and Analysis
The evaluation results are compared and analyzed in this section.Based on the analysis results, the capture was calculated to determine the optimal location for a new retail agglomeration.

Comparative Analysis of Evaluation Results
To obtain the customers' sensitivities with high accuracy, evaluation results of local and global sensitivities were compared.Furthermore, the characteristics of the spatial distribution of the local sensitivities were also investigated.Based on the sample set extracted from the geotagged social media data, customers' local sensitivities were evaluated by using GWR, and the global sensitivities were evaluated by using ordinary least squares (OLS).Ordinary least squares is a method for estimating unknown parameters in a linear regression model [47].The global and local evaluation results are shown in Table 3.Two global parameters are significant with p-values < 0.001.To detect the collinearity problems of the GWR, VIFs and local-CN were calculated.The VIFs varied from 1.0 to 8.59 and the range of local-CN values is from 2.28 to 9.35.Based on the general rule proposed by Belsley et al. [45], there are no local collinearity problems in the process of local evaluation.The local sensitive parameters of business area (α i ) and distance (β i ) are significant for 21.61% and 90.42% of the samples, respectively, which indicates that most customers tend to care more about the distance than the business area.The evaluation accuracy of customers' local sensitivities is higher than that of the global.The R 2 and AICc values were applied to estimate the accuracy of the sensitivity evaluation.As shown in Table 3, the R 2 of the local sensitive parameters is 0.73 and is significantly higher than that of the global.Additionally, the AICc of the local parameters is lower than that of the global.These results indicated that the customers' sensitivities in the real world tend to be local.The mean local parameters α i and λ i are 1.04 and −1.16, respectively.The global parameters α and λ are 0.97 and −1.04, respectively.The differences between the local and global parameters demonstrate that the global evaluation may underestimate customers' sensitivities to business area and overestimate sensitivities to distance.Because of their high accuracy, the local parameters were applied to determine the optimal location for a new retail agglomeration.
The spatial distributions of the local α i and λ i are presented in Figures 4 and 5.As shown in Figure 4, most spatial units with high values of local α i (from 1.5 to 2.5) have relatively convenient transportation and customers in these areas are more willing to visit the retail agglomerations with large business areas.As shown in Figure 5, the spatial units with low absolute values of λ i (from 0.0 to −1.0) were located in the north part of study area.High absolute values (from −2.8 to −2.0) were found in the units far from the existing retail agglomerations, which indicated that customers living far from retail agglomerations are more sensitive to the distance.
global evaluation may underestimate customers' sensitivities to business area and overestimate sensitivities to distance.Because of their high accuracy, the local parameters were applied to determine the optimal location for a new retail agglomeration.
The spatial distributions of the local αi and λi are presented in Figure 4 and Figure 5.As shown in Figure 4, most spatial units with high values of local αi (from 1.5 to 2.5) have relatively convenient transportation and customers in these areas are more willing to visit the retail agglomerations with large business areas.As shown in Figure 5, the spatial units with low absolute values of λi (from 0.0 to -1.0) were located in the north part of study area.High absolute values (from -2.8 to -2.0) were found in the units far from the existing retail agglomerations, which indicated that customers living far from retail agglomerations are more sensitive to the distance.

Capture Analysis
Based on the sensitive parameters that were evaluated, the captures for different feasible locations were calculated and analyzed to determine the optimal location for a new retail agglomeration.For each feasible area in Figure 3, the capture was calculated by using Equation (10) and Equation (11).To analyze the effect of business area, the business area for a new retail agglomeration was set as 40,000, 60,000 and 80,000 m 2 .Additionally, local and global sensitive parameters were all applied in the capture estimation to reveal the differences between the local and global captures.
The captures for different feasible locations are shown in Table 4.There is a significant difference

Capture Analysis
Based on the sensitive parameters that were evaluated, the captures for different feasible locations were calculated and analyzed to determine the optimal location for a new retail agglomeration.For each feasible area in Figure 3, the capture was calculated by using Equations ( 10) and (11).To analyze the effect of business area, the business area for a new retail agglomeration was set as 40,000, 60,000 and 80,000 m 2 .Additionally, local and global sensitive parameters were all applied in the capture estimation to reveal the differences between the local and global captures.
The captures for different feasible locations are shown in Table 4.There is a significant difference between two types of capture.Compared with the local capture, the global capture was higher at location A and was lower at locations B and C. The optimal location was determined at the base of local capture.Location A maximizes the local capture for the business area of 80,000 m 2 and location B maximizes the local captures for 40,000 and 60,000 m 2 .Therefore, location A is the optimal location for a new retail agglomeration with a business area of 80,000 m 2 , and location B is the optimal location for 40,000 or 60,000 m 2 .

Conclusions
The development of location services provided considerable opportunities for applying geotagged social media data to locate new retail facilities and agglomerations.In this study, we proposed an improved method for using social media data to solve competitive location problems based on customers' local sensitivities.The results indicated that: (1) our method can locate a new retail agglomeration in a cost-effective way; (2) social media samples can be a reliable data source for the evaluation of customers' local sensitivities; (3) the customers far from the existing retail agglomerations may be more sensitive to the distance.Based on our study, decision makers can make more effective strategies to attract different types of customers.For example, to the customers who are very sensitive to the distance, decision makers can provide more convenient transportation modes to them.
Most previous studies first extracted suitable samples from location data (such as mobile phone location data, taxi trajectory data and social media data).Then, based on the extracted samples, they are mainly focused on applying the single Huff model to evaluate customers' global sensitivities [10,11,14].Compared to previous studies, our approach is different.The approach in our study consists of three parts: sample extraction, local sensitivity evaluation and capture estimation.In the process of local sensitivity evaluation, the Huff model was combined with GWR to evaluate the spatial distribution of customers' local sensitivities with accuracy in a high spatial resolution.Based on the evaluation result, optimal location for a new retail agglomeration can be determined.Our method can be applied to locate retail facilities with large business areas or retail agglomerations in the spatial area where a large amount of location data were generated daily.
In future studies, more attention should be paid to alleviating the disadvantages of social media data, and following challenges should be addressed:

1.
Representability.Social media services are widely used among young people.The age structure of social media users is different from that of the real world [18].Therefore, social media data can only be used as an approximate representation of the population density and customers' behavior in the real world.Our research team will investigate the impact of the representability of social media data on competitive location problem.

2.
Text.Text information is an important attribute of social media data.People can post text that expresses their feelings and opinions about a retail facility.Therefore, from the text, we can find more factors that can attract customers.Based on text analysis, more facility attractions can be added to the competitive location models to further improve accuracy of the evaluation of the customers' sensitivities.

3.
Modifiable area unit problem (MAUP).In our case, 600 meter * 600 meter grids were applied to divide the study area based on previous studies.Different sizes of spatial units can generate different results, and the optimal size needs to be investigated.In the future, we will reveal the effect of the size of spatial units in competitive location problems and obtain the best-fit size.4.
Noise filtering.Based on the manual analysis of noises, we investigated the characteristics of noises in Sina Weibo dataset.The microblogs with particular symbols and "source" were identified as noises and filtered out.Although this process can filter out noises effectively, it is very time consuming and labor intensive.We need to develop machine learning procedures to remove noises.

5.
Home location extraction.In this case, we applied the method proposed by Qu et al. for extracting the home locations of Sina Weibo users [12].In the study of Qu et al, the home locations extracted from geotagged social media data were compared to the real homes.Although the accuracy of the proposed method has been proved to be higher than many other methods in their study, the accuracy was not evaluated in our dataset.In the future work, the electronic questionnaires will be sent to the Sina Weibo users and the accuracy of this method will be further investigated.6.
Privacy issues.Social media data contains a large amount of personal information (such as registration locations, age, friends and attitudes).Most users did not notice that their post information could be publicly obtained on the Internet and was applied to published research.More studies are needed to explore the protection of the privacy of social media users and provide guidance on developing academic ethical standards in social media data application.

Figure 1 .
Figure 1.Study area and the distribution of the retail agglomerations.

Figure 1 .
Figure 1.Study area and the distribution of the retail agglomerations.

Figure 2 .
Figure 2. Framework for using social media data to solve competitive location problems.

Figure 2 .
Figure 2. Framework for using social media data to solve competitive location problems.

Figure 3 .
Figure 3. Feasible areas for a new retail agglomeration in the study area.

Figure 3 .
Figure 3. Feasible areas for a new retail agglomeration in the study area.

Figure 4 .
Figure 4. Spatial distribution of local sensitive parameters of business area.

Figure 5 .
Figure 5. Spatial distribution of local sensitive parameters of distance.

Table 2 .
Detail information about the changes of the dataset.

Table 3 .
Evaluation results of global and local sensitive parameters.

Table 4 .
Global and local captures for different feasible locations.