A Hybrid Population Distribution Prediction Approach Integrating LSTM and CA Models with Micro-Spatiotemporal Granularity: A Case Study of Chongming District, Shanghai

: Studying population prediction under micro-spatiotemporal granularity is of great sig-niﬁcance for modern and reﬁned urban trafﬁc management and emergency response to disasters. Existing population studies are mostly based on census and statistical yearbook data due to the limitation of data collecting methods. However, with the advent of techniques in this information age, new emerging data sources with ﬁne granularity and large sample sizes have provided rich materials and unique venues for population research. This article presents a new population prediction model with micro-spatiotemporal granularity based on the long short-term memory (LSTM) and cellular automata (CA) models. We aim at designing a hybrid data-driven model with good adaptability and scalability, which can be used in more reﬁned population prediction. We not only try to integrate these two models, aiming to fully mine the spatiotemporal characteristics, but also propose a method that fuses multi-source geographic data. We tested its functionality using the data from Chongming District, Shanghai, China. The results demonstrated that, among all scenarios, the model trained by three consecutive days (ordinary dates), with the granularity of one hour, incorporated with road networks, achieves the best performance (0.905 as the mean absolute error) and generalization capability.


Introduction
Population distribution prediction refers to estimating the population of a specific geographic unit, taking into account the impact of natural geography and the socioeconomic environment and applying scientific methods (predictive models) to estimate population development in another temporal period [1]. However, traditional population prediction generally has extremely coarse spatiotemporal granularity. With the advent of information handling techniques, new data collection methods, data processing algorithms, and improvements in computing power have made population prediction with micro-spatiotemporal granularity possible. Since ancient times, population estimation and prediction have been essential in human society for policymaking and socioeconomic planning, such as urban and health care planning at different administration levels. Studies on population prediction can be traced back to the United Kingdom in 1695 [2]. After years of exploration and efforts by mathematicians, statisticians, demographers, geographers, and other scholars, a series of models for population prediction is proposed.
In general, population prediction models can be grouped into four categories, given the approaches they adopt: (1) mathematical models, (2)

statistical models, (3) demographic
This study uses mobile phone signaling data, an emerging type of population data different from population data collected by traditional methods. Mobile phone data enables population prediction with micro-spatiotemporal granularity, characterized by fine-grained spatiotemporal granularity compared with traditional methods. The temporal resolution can be as fine as hours or even minutes, and the spatial resolution can be at a sub−1 km scale. Mobile phone signaling data have extensive spatial coverages, high data collection frequencies, long temporal spans, and spatial grid as a regular statistical unit. Based on the literature, extremely few methods can handle population datasets with such spatiotemporal granularity. LSTM has the ability to solve long-term dependency problems, given its unique design structure. Its design makes it suitable for processing and predicting important events with long intervals and delays in time series, ideal for investigating population dynamics. CA is a dynamic grid model with discrete time, space, and state. Since its spatial interactions and time causality are local-focused, CA is often adopted to simulate the process of the spatiotemporal evolution of complex systems (e.g., geographical environment). In addition, the concept of cells in CA is suitable to express the spatial distribution of the population.
Although there exist some efforts that integrate LSTM and CA models [21,22], their models lack the ability to integrate multi-source geographic information and, to our best knowledge, no effort has been made to investigate such integration in predicting the spatial distribution of population. In light of the background of population prediction with microspatiotemporal granularity and the unique characteristics of such data, we attempt to build a data-driven population prediction framework by integrating two basic models: LSTM and CA.

Overview of the Study Area
This study uses data from Chongming District (extending 121 • 09 30 E to 121 • 54 00 E and 31 • 27 00 N to 31 • 51 15 N), which is a municipal district in Shanghai, China. It borders the Yangtze River to the west and the East China Sea to the east. On the northern side, it borders Haimen City and Qidong City. Towards the south, the Chongming District is surrounded by three other districts of Pudong, Jiading, and Baoshan ( Figure 1). With a total area of 1413 square kilometers, the Chongming District is mainly composed of islands that include Chongming Island, Hengsha Island, and Changxing Island. Among these, Chongming Island is the third-largest Chinese island and the world's largest estuary alluvial island [23].
Located at the midpoint of the Chinese coastline, Chongming District is the estuary of the Yangtze River, the longest river in China. The terrain of Chongming Island is largely flat, and the elevation of more than 90% of the area is between 3.21 m and 4.20 m (Wujing Elevation). According to the 2018 Shanghai Yearbook Statistics, by the end of 2017, the registered population in Chongming District was 675,900, including 694,600 permanent residents and 141,900 foreign residents. In the whole district, the density of population is 586 people per square kilometer. The natural population growth rate was −0.51%. In 2017, according to the Statistical Communique of National Economic and Social Development of Chongming District, the per capita disposable income of all residents in the region was 5654 US dollars, and the per capita disposable income of rural residents was 3930 US dollars. The annual GDP was 5.14 billion US dollars, of which the tertiary industries grew up more by 12.8% over the previous year, thus making a total of 2.66 billion US dollars, equivalent to 51.8% of the region's GDP.

Data Characteristics
The data used in this study are mobile phone signaling data collected by China Unicom, covering the entire Chongming District, Shanghai, with a temporal span from 27 September 2018, to 10 October 2018. As mobile phone signaling data involve sensitive and private information, the actual data has been aggregated to 10 min as the temporal unit and to regular grids of 250 m by 250 m. We are not involved in the mobile phone signaling data collection and aggregation process, but we can confirm that the data we retrieve has been anonymized. This dataset collects mobile phone signaling data, precisely reflecting the number of mobile services, thus serving as an ideal proxy to population distribution. The file size is about 871 MB (MByte), with a total of 21,833,490 lines of records. The original data is in text format, with each line storing one record and fields separated by "|" (Figure 2). There are four attribute fields. Table 1

Data Characteristics
The data used in this study are mobile phone signaling data collected by China Unicom, covering the entire Chongming District, Shanghai, with a temporal span from 27 September 2018, to 10 October 2018. As mobile phone signaling data involve sensitive and private information, the actual data has been aggregated to 10 min as the temporal unit and to regular grids of 250 m by 250 m. We are not involved in the mobile phone signaling data collection and aggregation process, but we can confirm that the data we retrieve has been anonymized. This dataset collects mobile phone signaling data, precisely reflecting the number of mobile services, thus serving as an ideal proxy to population distribution. The file size is about 871 MB (MByte), with a total of 21,833,490 lines of records. The original data is in text format, with each line storing one record and fields separated by "|" (Figure 2). There are four attribute fields. Table 1

Data Characteristics
The data used in this study are mobile phone signaling data collected by China Unicom, covering the entire Chongming District, Shanghai, with a temporal span from 27 September 2018, to 10 October 2018. As mobile phone signaling data involve sensitive and private information, the actual data has been aggregated to 10 min as the temporal unit and to regular grids of 250 m by 250 m. We are not involved in the mobile phone signaling data collection and aggregation process, but we can confirm that the data we retrieve has been anonymized. This dataset collects mobile phone signaling data, precisely reflecting the number of mobile services, thus serving as an ideal proxy to population distribution. The file size is about 871 MB (MByte), with a total of 21,833,490 lines of records. The original data is in text format, with each line storing one record and fields separated by "|" (Figure 2). There are four attribute fields. Table 1     The original data contain collected time, latitude, and longitude information. Since the integrated model requires serialization characteristics for the input data, we re-ranked the data according to the timestamp and divided them into individual files by day. Finally, 14 text files are obtained, ranging from 27 September 2018 to 10 October 2018, sorted by timestamp. Thus, data of different longitudes and latitudes at the same time are gathered together. It should be noted that all data are referenced on the World Geographic Coordinate System (WGS-84), and their spatial sampling accuracy (250 m × 250 m) is determined during collection. Since it is difficult to determine spatial characteristics of data in text format, we used the Geospatial Data Abstraction Library (GDAL) to convert these text files into raster format to facilitate visual representation. Figure   The original data contain collected time, latitude, and longitude informati the integrated model requires serialization characteristics for the input data, we r the data according to the timestamp and divided them into individual files by day 14 text files are obtained, ranging from 27 September 2018 to 10 October 2018, timestamp. Thus, data of different longitudes and latitudes at the same time are together. It should be noted that all data are referenced on the World Geographi nate System (WGS-84), and their spatial sampling accuracy (250 m × 250 m) is de during collection. Since it is difficult to determine spatial characteristics of da format, we used the Geospatial Data Abstraction Library (GDAL) to convert t files into raster format to facilitate visual representation. Figure

Overview of the LSTM and CA Models
In this study, the temporal period for data collection is considerably long. C with classic time series models, LSTM can benefit the extraction of long-term fe the temporal dimension and the extraction of those features in an unsupervised largely increasing its utility and generalizability. The internal structure of the ne works is presented in Figure 4. X denotes the original input in the current time s

Overview of the LSTM and CA Models
In this study, the temporal period for data collection is considerably long. Compared with classic time series models, LSTM can benefit the extraction of long-term features in the temporal dimension and the extraction of those features in an unsupervised manner, largely increasing its utility and generalizability. The internal structure of the neural networks is presented in Figure 4. X t denotes the original input in the current time step. C t−1 and C t represent the memory units of the previous time step and the current time step, respectively. H t−1 and H t represent the hidden state of the previous time step and the current time step, respectively. and C represent the memory units of the previous time step and the c respectively. H and H represent the hidden state of the previous t current time step, respectively. As mentioned above, to mitigate the problem of short-term mem structs a special structure inside the networks with memory cells, which formation along the sequence, and the structure can be understood as the networks. It includes three gates: forget gate, input gate, and output Given time step , the calculating process of the forget gate, input gate can be presented as follows: ℎ ℎ ℎ where , , ∈ ℝ * (n is the number of samples and h is the number are forget gate, input gate, and output gate, respectively. The current tim (d is the number of features) is the input, and the hidden state of the pre ℎ ∈ ℝ * . , , ∈ ℝ * and , , ∈ ℝ * are the we ℝ * denote the bias parameters.
At time step , the candidate memory cell ∈ ℝ * is calculated a ℎ ℎ where ∈ ℝ * and ∈ ℝ * are the weights of the parameters, the bias parameter. The update of the memory cells is based on the mem in the previous time step. The combination of the forget gate, and the inp control the flow of internal information: where the forget gate determines how much information is retained in ∈ ℝ * , and the input gate determines how much information is didate memory cell .
The hidden state is realized by controlling the flow of internal in memory cell through the output gate. The calculating process of the h As mentioned above, to mitigate the problem of short-term memory, LSTM constructs a special structure inside the networks with memory cells, which can pass the information along the sequence, and the structure can be understood as the "memory" of the networks. It includes three gates: forget gate, input gate, and output gate.
Given time step t, the calculating process of the forget gate, input gate, and output gate can be presented as follows: where f t , i t , o t ∈ R n * h (n is the number of samples and h is the number of hidden units) are forget gate, input gate, and output gate, respectively. The current time step x t ∈ R n * d (d is the number of features) is the input, and the hidden state of the previous time At time step t, the candidate memory cell C t ∈ R n * h is calculated as: where W xc ∈ R d * h and W hc ∈ R h * h are the weights of the parameters, and b c ∈ R 1 * h is the bias parameter. The update of the memory cells is based on the memory cells passed in the previous time step. The combination of the forget gate, and the input gate is able to control the flow of internal information: where the forget gate f t determines how much information is retained in the memory cell C t−1 ∈ R n * h , and the input gate i t determines how much information is added in the candidate memory cell C t . The hidden state is realized by controlling the flow of internal information of the memory cell through the output gate. The calculating process of the hidden state is as follows: With the gates mentioned above, LSTM can effectively extract long-term temporal patterns in mobile phone signaling data.
A CA system contains four basic elements: cells, states, neighborhoods, and conversion rules. Cells are the basic objects (or units) of a CA system. Space formed by all cells is called cell space. States of cells (generally discrete) have different choices based on different scenarios. However, the population data in this study are continuous. Conversion rules act on local scopes, and neighboring cells need to be defined to form neighborhoods. The conversion rule is a function that determines the state of the next moment according to the current state of the cell and the states of the neighboring cells. Therefore, this study aims to determine the neighborhood and extract the conversion rules effectively.

Model Integration
The integrated model we proposed is a data-driven model that can adapt to the spatiotemporal characteristics of mobile phone signaling data. Due to the high frequency and long acquisition period, classic time series methods such as ARIMA may not be feasible, as they cannot analyze multi-dimensional data and require stationary data or stationary after differentiating. LSTM benefits the extraction of both short-term and longterm temporal features. In addition, the population data is collected at the grid level, similar to the concept of cells in CA. Moreover, the neighborhood and extraction algorithms of conversion rules in CA have good scalability. Therefore, we integrate these two models to build a population prediction model with a micro-spatiotemporal granularity that can fully adapt to the characteristics of mobile phone signaling data. In the process of integrating LSTM and CA models, we consider two main aspects. First, we need to take advantage of the respective advantages of LSTM in sequence data and CA in spatial simulation. Second, we need to ensure the generalizability of the model. From a macro perspective, a CA model is the upper-level framework of the integrated model, which defines the interaction between the cells in the simulation space through varying definitions of neighborhoods. The LSTM can be seen as an underlying algorithm for automatically extracting conversion rules in CA. In this study, LSTM is able to handle sequential data with the input of a three-dimensional tensor that includes samples (corresponding to the grids), timesteps, and features. Relying on the upper-level CA model, features can incorporate geographical information from various neighboring scenarios. The output of the CA-LSTM-integrated model is a two-dimensional tensor that includes samples (grids) and the corresponding population counts (in this study, mobile user counts). In this integration framework, CA and LSTM, respectively, handle information from spatial and temporal perspectives. Their integration benefits efficient spatiotemporal data fusion that leads to accurate population prediction. Specifically, there are two considerations: one is to use the cells in the CA to represent the grid. Different definitions of the neighborhood can incorporate more spatial information into the model. The second is to take advantage of the unique mechanism of LSTM for extracting complex time series features in population data. Figure 5 presents the schematic diagram of the integration between LSTM and CA.
In Figure 5, X denotes the original input while X denotes the model input after temporal selection and neighborhood construction. Different uses of t represent different timestamps. C t−1 , C t , H t−1 , and H t are the intermediate input and output of the LSTM unit during the training process, and y represents the model output.
In general, the integrated model first selects the time series of the original data, then builds features in the attribute dimension according to the definition of the neighborhood in CA. Further, structurally transformed data are fed into the LSTM for training. The calculation at each time step generates new memory cells and hidden states. At the end of the entire training process, LSTM learns a set of parameters inside the model that can be used for later predictions.
two considerations: one is to use the cells in the CA to represent the grid. Different definitions of the neighborhood can incorporate more spatial information into the model. The second is to take advantage of the unique mechanism of LSTM for extracting complex time series features in population data. Figure 5 presents the schematic diagram of the integration between LSTM and CA.  Another consideration of this model integration is whether the model itself is with sufficient scalability and universality. The structural requirements of the integrated model for the input data are three-dimensional tensors with the form of sample, timestep, and feature, where the timestep corresponds to the information on the time series, and the feature corresponds to the definition of the neighborhood in the CA. The training of the integrated model is data-driven, and the information in the temporal and spatial dimensions is implicit in the input data. As no additional controls are required besides the adjustment of input data, the integrated model is with strong scalability and generalizability. In addition, given that internal parameters are with acceptable initialization settings, manual adjustments are not required, which demonstrates the robustness and automaticity of the proposed model.

Spatial Extension
The spatial extension of the model in this paper is achieved via the perspective of neighborhood construction. In the experiment, we select three different neighborhoods: the special neighborhood of the central cell, the Moore neighborhood (R = 1), and the extended Moore neighborhood (R = 1) by extracting cell weights based on road networks.
The first neighborhood construction is relatively simple, that is, the choice of the central cell while ignoring its neighborhoods. The Moore neighborhood (R = 1) is a classic method of defining neighborhood, as shown in Figure 6.
poral selection and neighborhood construction. Different uses timestamps. C , C , H , and H are the intermediate input a unit during the training process, and y represents the model out In general, the integrated model first selects the time series o builds features in the attribute dimension according to the definit in CA. Further, structurally transformed data are fed into the LST culation at each time step generates new memory cells and hidd the entire training process, LSTM learns a set of parameters insid used for later predictions.
Another consideration of this model integration is whether sufficient scalability and universality. The structural requirements for the input data are three-dimensional tensors with the form o feature, where the timestep corresponds to the information on feature corresponds to the definition of the neighborhood in the integrated model is data-driven, and the information in the temp sions is implicit in the input data. As no additional controls are r justment of input data, the integrated model is with strong scalabi In addition, given that internal parameters are with acceptable init ual adjustments are not required, which demonstrates the robustn the proposed model.

Spatial Extension
The spatial extension of the model in this paper is achieve neighborhood construction. In the experiment, we select three d the special neighborhood of the central cell, the Moore neighborh tended Moore neighborhood (R = 1) by extracting cell weights ba The first neighborhood construction is relatively simple, that tral cell while ignoring its neighborhoods. The Moore neighbor method of defining neighborhood, as shown in Figure 6. The combination of the Moore neighborhood (R = 1) and algo weights based on road networks is another way to build neighborh on the idea of "decay neighborhood with distance" [24-26] from t The combination of the Moore neighborhood (R = 1) and algorithm for extracting cell weights based on road networks is another way to build neighborhoods. This paper draws on the idea of "decay neighborhood with distance" [24][25][26] from the previous studies and uses data of road networks to ensure that proper spatial information is considered. Figure 7 presents the road networks in Chongming District. The method of constructing neighborhoods from road networks is bas sumption that population movements are affected by surrounding areas of as roads, tourist areas, etc.). With the consideration of road elements, peop tendency to move towards the road in a random state, as the cells closer to t a greater impact on the central cell. To ensure that the total impact weight o the weights are normalized via the following formula: W e e ∑ where W represents the weight of the i-th cell, and D represents the dista cell from the nearest road. Given the distance to the nearest road, the we calculated considering an exponential decaying effect, then normalized afte tion.
With the use of the Moore neighborhood (R = 1), road data are integr model, as cell weights of the neighborhood are calculated based on the r Such information affects the extraction of spatial features during model trai

Experiments and Results
In the first experiment, we selected the central cell without considering hood. Since the period of the data covers the Chinese National Day (1 Octob nary dates, the training was conducted separately to investigate the heterog temporal pattern of population changes. In addition, the minimum time gran data (10 min) and different time granularities (i.e., 1 h) were used to show t the generalization effect of the different models. We also considered combin The method of constructing neighborhoods from road networks is based on the assumption that population movements are affected by surrounding areas of interest (such as roads, tourist areas, etc.). With the consideration of road elements, people are with a tendency to move towards the road in a random state, as the cells closer to the road have a greater impact on the central cell. To ensure that the total impact weight of all cells is 1, the weights are normalized via the following formula: where W i represents the weight of the i-th cell, and D i represents the distance of the i-th cell from the nearest road. Given the distance to the nearest road, the weights are first calculated considering an exponential decaying effect, then normalized after the summation. With the use of the Moore neighborhood (R = 1), road data are integrated into the model, as cell weights of the neighborhood are calculated based on the road network. Such information affects the extraction of spatial features during model training.

Experiments and Results
In the first experiment, we selected the central cell without considering its neighborhood. Since the period of the data covers the Chinese National Day (1 October) and ordinary dates, the training was conducted separately to investigate the heterogeneity in the temporal pattern of population changes. In addition, the minimum time granularity of the data (10 min) and different time granularities (i.e., 1 h) were used to show the impact on the generalization effect of the different models. We also considered combining data from multiple days to reveal the longer-term characteristics in the time series. An additional experiment was conducted by applying extended neighborhoods that aim to include spatial information in the model. Figure 8 presents the specific training method. In total, we trained six models, in which the first four models were constructed bas on a single temporal dimension, and the other two were composite models that integrat spatiotemporal perspectives. For purposes of evaluating the generalization effect of model, we used two evaluation methods: an absolute evaluation and a relative evaluati The absolute evaluation applies the mean absolute error (MAE) as the evaluation metr where is the true value, is the training weights, and ; is the model output Relative evaluation is a baseline assessment. Before training the model, an id model based on common sense as a reasonable reference point was established. Af comparing the trained model with the baseline model, the relative quality of the mo can be judged. Since our research problem targets time series population prediction, simplest model assumption is that the predicted results at the current time step in the t set are the same as the value at the same time in the training set. In this study, the divisi of training set and testing set is unique, given the fact that our model has spatial and te poral dependencies. Therefore, unlike the traditional method of splitting the dataset i fixed ratio, we train with data for a period of time and predict with data for another peri of time. All implementations in this study are completed via Python programming la guage, specifically using TensorFlow and Keras libraries. In this study, the hyperpara eters for all models are the same. We use RELU as the activation function with the Ad algorithm as the optimizer (default setting).

Time Dimension
Neural network training requires a balance between optimization and generali tion. One of the focuses of this study is to improve the generalizability of the model, the application of population prediction models requires a certain degree of robustne Since neural networks have a strong ability in data fitting, the problem of overfitting of occurs during training. We added an L2 regular term to the loss function when designi the model to solve this problem. Another possible solution is to perform manual adju ment by regulating the number of learning parameters. The number of learning param ters determines the number of features that a model can learn, and the larger the numb the more complex the model. A small number of parameters can lead to underfittin while a high number can lead to overfitting, which greatly affects the model's generali tion ability. Therefore, the adjustment of hyperparameters in studies should primarily t get this parameter. Despite that, there are many hyperparameters in the deep learni In total, we trained six models, in which the first four models were constructed based on a single temporal dimension, and the other two were composite models that integrated spatiotemporal perspectives. For purposes of evaluating the generalization effect of the model, we used two evaluation methods: an absolute evaluation and a relative evaluation. The absolute evaluation applies the mean absolute error (MAE) as the evaluation metric: where y i is the true value, θ is the training weights, and f (x i ; θ) is the model output.
Relative evaluation is a baseline assessment. Before training the model, an ideal model based on common sense as a reasonable reference point was established. After comparing the trained model with the baseline model, the relative quality of the model can be judged. Since our research problem targets time series population prediction, the simplest model assumption is that the predicted results at the current time step in the test set are the same as the value at the same time in the training set. In this study, the division of training set and testing set is unique, given the fact that our model has spatial and temporal dependencies. Therefore, unlike the traditional method of splitting the dataset in a fixed ratio, we train with data for a period of time and predict with data for another period of time. All implementations in this study are completed via Python programming language, specifically using TensorFlow and Keras libraries. In this study, the hyperparameters for all models are the same. We use RELU as the activation function with the Adam algorithm as the optimizer (default setting).

Time Dimension
Neural network training requires a balance between optimization and generalization. One of the focuses of this study is to improve the generalizability of the model, as the application of population prediction models requires a certain degree of robustness. Since neural networks have a strong ability in data fitting, the problem of overfitting often occurs during training. We added an L2 regular term to the loss function when designing the model to solve this problem. Another possible solution is to perform manual adjustment by regulating the number of learning parameters. The number of learning parameters determines the number of features that a model can learn, and the larger the number, the more complex the model. A small number of parameters can lead to underfitting, while a high number can lead to overfitting, which greatly affects the model's generalization ability. Therefore, the adjustment of hyperparameters in studies should primarily target this parameter. Despite that, there are many hyperparameters in the deep learning framework (e.g., batch size and epoch); only the number of learning parameters remained adjustable in this experiment with other hyperparameters fixed to ensure comparability in this study.
Four models were trained, respectively, using 1 h granularity data on 27 September 2018 (Model 1), 1 h granularity data on 1 October 2018 (Model 2), 10 min granularity data on 27 September 2018 (Model 3), and 1 h granularity data on 27-29 September 2018 (Model 4). The numbers of best learning parameters are, respectively, set as 10, 10, 18, and 22. The baseline errors are 1.245, 1.636, 1.245, and 1.871, respectively. The generalization performance on the test set is shown in Tables 2-5. Figure 9 presents the generalization error plot of the four models on the test set. Baseline error is calculated from baseline assessment, and the test error is derived from the prediction of the trained model.    According to Figure 9, it is observed that with an increase in the amount of data and the decrease in time granularity, the number of optimal learning parameters of the model increases, suggesting that the model needs more parameters to express such characteristics in the time dimension. When comparing Model 1 and Model 2, we found that there exists a certain periodic feature in the single-day population data, and the expression of this feature on the normal date and the National Day is different. Despite such discrepancies, the similarity of changes in population dynamics during one day is notable. We also observed that this rule performs better on data of ordinary dates. When comparing Model 1 and Model 3, we found that a small time granularity does not necessarily perform better. The generalization error of the data at 1 h granularity is lower than that of the data at 10 min. As in certain occasions, specific characteristics exist under a specific granularity. Sometimes, data with a fine time granularity may lose key features. In addition, even random data can express certain correlations, as some scholars have found that some correlations can be found even for data with complete noises [21]. When comparing Model 1 and Model 4, we found that the multi-day data generally contain richer information and have hidden features on the long-term sequence. Given its great performance, the subsequent model training is based on Model 4. According to Figure 9, it is observed that with an increase in the amount of data and the decrease in time granularity, the number of optimal learning parameters of the model increases, suggesting that the model needs more parameters to express such characteristics in the time dimension. When comparing Model 1 and Model 2, we found that there exists a certain periodic feature in the single-day population data, and the expression of this feature on the normal date and the National Day is different. Despite such discrepancies, the similarity of changes in population dynamics during one day is notable. We also observed that this rule performs better on data of ordinary dates. When comparing Model 1 and Model 3, we found that a small time granularity does not necessarily perform better. The generalization error of the data at 1 h granularity is lower than that of the data at 10 min. As in certain occasions, specific characteristics exist under a specific granularity. Sometimes, data with a fine time granularity may lose key features. In addition, even random data can express certain correlations, as some scholars have found that some correlations can be found even for data with complete noises [21]. When comparing Model 1 and Model 4, we found that the multi-day data generally contain richer information and have hidden features on the long-term sequence. Given its great performance, the subsequent model training is based on Model 4.

Spatiotemporal Dimension
The above four models only consider a single time dimension and do not adopt the concept of neighborhood. In this experiment, we incorporate spatial attribute information into the model from two perspectives: (1) the classic Moore neighborhood (R = 1) (Model 5), and (2) the extended Moore neighborhood (R = 1) combined with extracting cell weights based on road networks (Model 6). The best learning parameters are set as 30 and 35, respectively, and the baseline errors are both 1.871. The generalization errors on the test set are shown in Tables 6 and 7. Figure 10 presents the errors for all six models.

Spatiotemporal Dimension
The above four models only consider a single time dimension and do not adopt the concept of neighborhood. In this experiment, we incorporate spatial attribute information into the model from two perspectives: (1) the classic Moore neighborhood (R = 1) (Model 5), and (2) the extended Moore neighborhood (R = 1) combined with extracting cell weights based on road networks (Model 6). The best learning parameters are set as 30 and 35, respectively, and the baseline errors are both 1.871. The generalization errors on the test set are shown in Tables 6 and 7. Figure 10 presents the errors for all six models.   We observed that the quality of the models is further improved with the consideration of spatial information. The spatial assumption of the Moore neighborhood is spatially contiguous, which limits model performance to a certain degree. In comparison, Model 6 fuses road network information, allowing more detailed spatial information to be introduced into the model, thereby further improving the quality of the model. However, the introduction of road network information leads to increased model complexity.

Discussion
Population distribution prediction has always been a hot research topic in geography and other spatial science-related fields. Better population distribution knowledge is expected to benefit a wide range of domains that include smart cities [27], regional planning [28,29], transportation management [29], and disaster mitigation [30], to list a few. After reviewing traditional population distribution prediction models, we notice that traditional methods, largely relying on the structured census, tend to have coarse spatiotemporal granularity. In recent years, the advances in information handing techniques, the introduction of data processing approaches, and improved computational power led to the popularity of population distribution prediction with fine spatiotemporal granularity. Entering the Big Data era, scholars start to resort to machine learning and automated algorithms, aiming to address the challenges in fine-scale population modeling studies.
LSTM and CA are popular approaches that, respectively, address temporal and spatial problems. Given its design, LSTM is featured by its capability in predicting events with long intervals and delays in time series, making it an ideal algorithm for population distribution prediction. On the other hand, CA has the capability of simulating spatiotemporal evolutions. The unique features from LSTM and CA motivate us to think about their potential integration, with CA as the upper-level framework and LSTM as the underlying algorithm to derive the conversion rules. Taking advantage of the mobile phone signaling data collected by China Unicom, we test the proposed CA-LSTM-integrated model in predicting the population distribution in Chongming District, Shanghai by setting different time granularity, timesteps, and neighboring scenarios. The results suggested that models with finer time granularity do not guarantee better performance, as data separated by finer granularity may lose key features, posing challenges for the models when they are making the prediction. We also notice that an increase in data put benefits the accuracy of the prediction, which is expected because deep learning models rely on data to establish a robust mapping between the input and the output. As for neighboring scenarios, we notice that models that consider spatial information outperform models that do not, which proves the important role of spatial dependency in population distribution prediction. In addition, we observe that models that consider spatial information characterized by road networks outperform models that consider spatial information characterized by the Moore neighborhood (spatially contiguous). This result points to the fact that population dynamics are more intertwined with road network settings than spatial closeness. Further efforts can be made to involve other spatial scenarios in population distribution prediction, aiming to better capture the factors that drive population movement.
Unfortunately, the mobile phone signaling data we have only lasted 14 days, which, to some degree, limits our investigation. In addition, cellphone counts only capture a certain spectrum of the population, while people who do not have access to mobile devices or who do not carry devices outside are underrepresented. In our future work, we plan to explore the capability of the proposed model in handling a longer time period and include different time granularities. In addition, the performance of different granularities needs to be tested with more case studies. We also intend to integrate more geographic information to investigate the potential improvement of the model's performance.

Conclusions
Although there exist some efforts that integrate LSTM and CA models, their models lack the ability to integrate multi-source geographic information and, to our best knowl-edge, no effort has been made to investigate such integration in predicting the spatial distribution of population. Taking mobile phone signaling data as input and Chongming District of China as the study case, we proposed a model that integrates LSTM and CA to extract population dynamics from spatiotemporal dimensions. The integrated model is constructed in an unsupervised manner with great scalability and the capability of fusing multi-source geographic data. We also applied different time granularities and varying methods of spatial neighborhoods.
Despite the great performance of the proposed model, challenges still remain. In this case study, we observed that the generalization abilities of trained models differ greatly, indicating that the training methods should be dependent on different research data types and scenarios. That is to say, different training data can lead to different best parameter settings. The internal parameters of the model should be adjusted appropriately in order to achieve optimal performance. The best model obtained in this paper is based on the training data of the extended Moore neighborhood (R = 1), combined with three-day ordinary dates, one-hour granularity, and extracting cell weights from road networks. The calculated testing error (MAE) for all cells is 0.905. The generality and robustness of the integrated model can be further improved by including more data sources. Benefiting from the proposed method of neighborhood construction, other data sources can be easily incorporated into our proposed model.