# QUADRIVEN: A Framework for Qualitative Taxi Demand Prediction Based on Time-Variant Online Social Network Data Analysis

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

- Most proposals address the problem of predicting taxi demand as a regression problem. Thus, they provide prediction outcomes in a quantitative manner (e.g., the future sheer number of pick-ups at a certain area of the city). However, this type of information might not be semantically meaningful in certain scenarios, as it may not refer to a certain contextual situation.
- Some proposals focus on anticipating taxi demand peaks in areas where the number of taxi pick-ups is expected to be much higher than in a normal situation. Nevertheless, there is a scarcity of proposals able to report a drop in the demand in spite of the fact that this information may be very valuable for operators as well [8,9].
- Current solutions usually rely on the data generated by the taxi service itself (e.g., GPS traces, pick-up and drop-off details, etc.) to build up the prediction models. This highly limits the scalability of the solutions, as they can only operate in cities with taxi services capable of generating and capturing the data required by the models.

## 2. The QUADRIVEN Framework

#### 2.1. Prediction Problem Statement

#### 2.2. Data Description

#### 2.2.1. Region Partitioning

#### 2.2.2. Required Datasets

#### OSN Data

#### Original Taxi Demand Record Data

#### Meteorological Data

#### 2.3. Correlational Study

#### 2.4. Calculation of the Number of Active Users

#### Home User Filtering

Algorithm 1: Pseudo-code of the Z-scores calculation for OSN count data, including home-user data removal. |

#### 2.5. Calculation of the Taxi Demand Quantiles

- Firstly, we aggregated the records per taxi zone and hour for each of the dates in the 22-month period. Let us define $t{p}_{d}^{rh}$ as the number of taxi pick-ups at region r at hour h in day d.
- Next, we created a set comprising all the $t{p}_{d}^{rh}$ values for every region and hour. This gave rise to $r\times h$ stratified sets ${\mathcal{TP}}^{rh}$. For example, the set ${\mathcal{TP}}^{4,9}$ comprised all the count values $t{p}_{d}^{4,9}$ with the number of pick-ups at Region #4 at 9:00 a.m. for all the dates d of the original dataset.
- Then, we calculated the lower (${\mathcal{Q}}_{1}^{rh}$) and upper quartiles (${\mathcal{Q}}_{3}^{rh}$) for each ${\mathcal{TP}}^{rh}$ set. At this point, we should remark that these quartiles were calculated for each particular region at a single hour of the day. This is because the taxi demand profile meaningfully varied depending on the target region and the hour of the day, as Figure 11 shows. This way, the obtained quartiles actually represent low and high boundaries of the taxi demand behavior in a region regardless of seasonal patterns.
- Finally, we mapped each $t{p}_{d}^{rh}$ value to their corresponding quartile range (low, middle, high) defined as $\langle t{r}_{low}^{hr},t{r}_{middle}^{hr},t{r}_{high}^{hr}\rangle $ by means of the following if-then rules,
- –
- If $t{p}_{d}^{rh}\le {\mathcal{Q}}_{1}^{rh}$, then the assigned label is $t{r}_{low}^{hr}$.
- –
- If $t{p}_{d}^{rh}>{\mathcal{Q}}_{1}^{rh}\phantom{\rule{4pt}{0ex}}\mathrm{and}\phantom{\rule{4pt}{0ex}}t{p}_{d}^{rh}\le {\mathcal{Q}}_{3}^{rh}$, then the assigned label is $t{r}_{middle}^{hr}$.
- –
- If $t{p}_{d}^{rh}>{\mathcal{Q}}_{3}^{rh}$, then the assigned label is $t{r}_{high}^{hr}$.

#### 2.6. Composition of the Classifier

- the target region r,
- the current hour of the day h,
- the current day of the week ${d}_{week}\in \langle 0,\dots ,6\rangle $,
- The z-scores ${z}_{FS}^{rh},{z}_{FL}^{rh},{z}_{BK}^{rh}$ of the three OSNs for region r at hour h,
- the current temperature t,
- the current rain level $rl$.

## 3. Evaluation of the Proposal

#### 3.1. Evaluated Models

#### 3.1.1. Conditional Random Fields

#### 3.1.2. Random Forest

#### 3.1.3. Support Vector Machine

#### 3.1.4. Long Short Term Memory Neural Network

#### 3.1.5. Fully Connected Neural Network

#### 3.2. Implementation Details

#### 3.3. Evaluation Settings

#### 3.4. Evaluation Metrics

#### 3.5. Results’ Discussion

#### 3.6. OSN Data Sources Comparison

## 4. Related Work

#### 4.1. Input Data Sources

#### 4.2. Data Mining Methods

#### 4.3. Prediction Outcome

## 5. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Di, Q.; Wang, Y.; Zanobetti, A.; Wang, Y.; Koutrakis, P.; Choirat, C.; Dominici, F.; Schwartz, J.D. Air pollution and mortality in the Medicare population. N. Engl. J. Med.
**2017**, 376, 2513–2522. [Google Scholar] [CrossRef] [PubMed] - Li, B.; Cai, Z.; Jiang, L.; Su, S.; Huang, X. Exploring urban taxi ridership and local associated factors using GPS data and geographically weighted regression. Cities
**2019**, 87, 68–86. [Google Scholar] [CrossRef] - De Brébisson, A.; Simon, E.; Auvolat, A.; Vincent, P.; Bengio, Y. Artificial Neural Networks Applied to Taxi Destination Prediction. In Proceedings of the 2015th International Conference on ECML PKDD Discovery Challenge (ECMLPKDDDC’15), Porto, Portugal, 7–11 September 2015; Volume 1526, pp. 40–51. [Google Scholar]
- Yang, Y.; Yuan, Z.; Fu, X.; Wang, Y.; Sun, D. Optimization Model of Taxi Fleet Size Based on GPS Tracking Data. Sustainability
**2019**, 11, 731. [Google Scholar] [CrossRef] - Peng, X.; Pan, Y.; Luo, J. Predicting high taxi demand regions using social media check-ins. In Proceedings of the 2017 IEEE International Conference on Big Data, Boston, MA, USA, 11–14 December 2017; pp. 2066–2075. [Google Scholar]
- Khezerlou, A.V.; Tong, L.; Street, W.N.; Li, Y. Predicting Urban Dispersal Events: A Two-Stage Framework through Deep Survival Analysis on Mobility Data. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 5199–5206. [Google Scholar]
- Ishiguro, S.; Kawasaki, S.; Fukazawa, Y. Taxi Demand Forecast Using Real-Time Population Generated from Cellular Networks. In Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers, Singapore, 8–12 October 2018; pp. 1024–1032. [Google Scholar]
- Smith, A.W.; Kun, A.L.; Krumm, J. Predicting Taxi Pickups in Cities: Which Data Sources Should We Use? In Proceedings of the 2017 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2017 ACM International Symposium on Wearable Computers, UbiComp ’17, Maui, HI, USA, 11–15 September 2017; pp. 380–387. [Google Scholar] [CrossRef]
- Liu, L.; Qiu, Z.; Li, G.; Wang, Q.; Ouyang, W.; Lin, L. Contextualized Spatial-Temporal Network for Taxi Origin-Destination Demand Prediction. IEEE Trans. Intell. Transp. Syst.
**2019**, 20, 3875–3887. [Google Scholar] [CrossRef] - Hawelka, B.; Sitko, I.; Beinat, E.; Sobolevsky, S.; Kazakopoulos, P.; Ratti, C. Geo-located Twitter as proxy for global mobility patterns. Cartogr. Geogr. Inf. Sci.
**2014**, 41, 260–271. [Google Scholar] [CrossRef] [PubMed] [Green Version] - James, N.A.; Kejariwal, A.; Matteson, D.S. Leveraging cloud data to mitigate user experience from ‘breaking bad’. In Proceedings of the 2016 IEEE International Conference on Big Data (Big Data), Washington, DC, USA, 5–8 December 2016; pp. 3499–3508. [Google Scholar] [CrossRef]
- Kuang, L.; Yan, X.; Tan, X.; Li, S.; Yang, X. Predicting Taxi Demand Based on 3D Convolutional Neural Network and Multi-Task Learning. Remote Sens.
**2019**, 11, 1265. [Google Scholar] [CrossRef] - Yao, H.; Wu, F.; Ke, J.; Tang, X.; Jia, Y.; Lu, S.; Gong, P.; Ye, J.; Li, Z. Deep multi-view spatial-temporal network for taxi demand prediction. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, Orleans, LA, USA, 2–7 February 2018; pp. 2588–2595. [Google Scholar]
- Thomee, B.; Shamma, D.A.; Friedland, G.; Elizalde, B.; Ni, K.; Poland, D.; Borth, D.; Li, L.J. YFCC100M: The New Data in Multimedia Research. Commun. ACM
**2016**, 59, 64–73. [Google Scholar] [CrossRef] - Cho, E.; Myers, S.A.; Leskovec, J. Friendship and Mobility: User Movement in Location-Based Social Networks. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’11, San Diego, CA, USA, 21–24 August 2011; pp. 1082–1090. [Google Scholar] [CrossRef]
- Estevez, P.A.; Tesmer, M.; Perez, C.A.; Zurada, J.M. Normalized Mutual Information Feature Selection. IEEE Trans. Neural Netw.
**2009**, 20, 189–201. [Google Scholar] [CrossRef] [PubMed] [Green Version] - McPherson, G. Statistics in Scientific Investigation: Its Basis, Application, and Interpretation; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
- Zheng, X.; Han, J.; Sun, A. A Survey of Location Prediction on Twitter. IEEE Trans. Knowl. Data Eng.
**2018**, 30, 1652–1671. [Google Scholar] [CrossRef] [Green Version] - Lafferty, J.D.; McCallum, A.; Pereira, F.C. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning, Williams College, WI, USA, 27 June–1 July 2001; pp. 282–289. [Google Scholar]
- Assam, R.; Seidl, T. Context-Based Location Clustering and Prediction Using Conditional Random Fields. In Proceedings of the 13th International Conference on Mobile and Ubiquitous Multimedia (MUM ’14), Melbourne, Victoria, Australia, 25–28 November 2014; pp. 1–10. [Google Scholar] [CrossRef]
- Genuer, R.; Poggi, J.M.; Tuleau-Malot, C.; Villa-Vialaneix, N. Random Forests for Big Data. Big Data Res.
**2017**, 9, 28–46. [Google Scholar] [CrossRef] - Cuenca-Jara, J.; Terroso-Saenz, F.; Sanchez-Iborra, R.; Skarmeta-Gomez, A.F. Classification of Spatio-Temporal Trajectories Based on Support Vector Machines. In Advances in Practical Applications of Agents, Multi-Agent Systems, and Complexity: The PAAMS Collection; Demazeau, Y., An, B., Bajo, J., Fernández-Caballero, A., Eds.; Springer International Publishing: Cham, Switzerlands, 2018; pp. 140–151. [Google Scholar]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res.
**2011**, 12, 2825–2830. [Google Scholar] - Tong, Y.; Chen, Y.; Zhou, Z.; Chen, L.; Wang, J.; Yang, Q.; Ye, J.; Lv, W. The Simpler The Better: A Unified Approach to Predicting Original Taxi Demands Based on Large-Scale Online Platforms. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, Halifax, NS, Canada, 13–17 August 2017; pp. 1653–1662. [Google Scholar] [CrossRef]
- Yan, A.; Howe, B. FairST: Equitable Spatial and Temporal Demand Prediction for New Mobility Systems. arXiv
**2019**, arXiv:1907.03827. [Google Scholar] - Markou, I.; Rodrigues, F.; Pereira, F.C. Real-Time Taxi Demand Prediction using data from the web. In Proceedings of the 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 1664–1671. [Google Scholar] [CrossRef]
- Zhou, Y.; Wu, Y.; Wu, J.; Chen, L.; Li, J. Refined Taxi Demand Prediction with ST-Vec. In Proceedings of the 26th International Conference on Geoinformatics, Kunming, China, 28–30 June 2018; pp. 1–6. [Google Scholar] [CrossRef]
- Moreira-Matias, L.; Gama, J.; Ferreira, M.; Mendes-Moreira, J.; Damas, L. Predicting Taxi–Passenger Demand Using Streaming Data. IEEE Trans. Intell. Transp. Syst.
**2013**, 14, 1393–1402. [Google Scholar] [CrossRef] - Jiang, S.; Chen, W.; Li, Z.; Yu, H. Short-Term Demand Prediction Method for Online Car-Hailing Services Based on a Least Squares Support Vector Machine. IEEE Access
**2019**, 7, 11882–11891. [Google Scholar] [CrossRef]

**Figure 1.**QUADRIVEN (QUalitative tAxi Demand pRediction based on tIme-Variant onlinE social Network data analysis) overview. In Step 1, the sheer number of active OSN users is extracted from two Online Social Network (OSN) platforms (OSN1 and OSN2) in Areas 1, 2, and 3. Similarly, the actual taxi demand in the same areas and time period is extracted in Step 2. In Step 3, a classifier is developed based on the association between the number of users in an area and expected taxi demand in such an area in the short term. In Step 4, the classifier is fed with the number of current active users in the target regions during a time interval. Finally, in Step 5, the predicted taxi demand is generated as categorical data. Notice that the information here is simplified for illustration purpose.

**Figure 2.**Evolution of the number of active users per day on Foursquare in a two-year period in Manhattan (New York City). Vertical red lines represent breakout points in the time series calculated with the E-Divisive with Median (EDM) algorithm [11].

**Figure 3.**Architecture of QUADRIVEN. In the training stage, historical data from n different OSNs are used. From these data, the z-scores associated with the number of active users per taxi zone are calculated following a batch processing. Those scores along with historical weather data are used to compose the independent variables of the training dataset. The dependent variable (label) is generated by extracting the quantile ranges of the taxi demand per taxi zone. Once the model has been trained, it takes the z-scores of the active users considering the last d days along with the current weather conditions to compose the qualitative taxi demand prediction per hour and taxi zone.

**Figure 5.**Distribution of the Flickr dataset. (

**a**) Spatial distribution; (

**b**) Temporal evolution. The uppermost figure shows the raw time series, whereas the other bottom ones depict its decomposition in trend, seasonal, and noise features.

**Figure 6.**Distribution of the Foursquare dataset. (

**a**) Spatial distribution; (

**b**) Temporal evolution. The uppermost figure shows the raw time series, whereas the other bottom ones depict its decomposition in trend, seasonal, and noise features.

**Figure 7.**Distribution of the Brightkite dataset. (

**a**) Spatial distribution; (

**b**) Temporal evolution. The uppermost figure shows the raw time series, whereas the other bottom ones depict its decomposition in trend, seasonal, and noise features.

**Figure 8.**Distribution of the taxi demand dataset. (

**a**) Spatial distribution; (

**b**) Temporal evolution. The uppermost figure shows the raw time series, whereas the other bottom ones depict its decomposition in trend, seasonal, and noise features.

**Figure 9.**Rate of home users per OSN and zone along with the averaged rates comprising all the OSNs.

**Figure 10.**Spatial distribution of home user locations per taxi zone in the three OSNs. (

**a**) Foursquare home user rates; (

**b**) Flickr home user rates; (

**c**) Brightkite home user rates.

**Figure 11.**Boxplots of the pick-ups per taxi zone at three particular hours of the day. The top side of each box represents the upper quantile ${\mathcal{Q}}_{3}^{rh}$ of a region, whereas the bottom side stands for the lower one ${\mathcal{Q}}_{1}^{rh}$. (

**a**) 8:00; (

**b**) 15:00; (

**c**) 21:00.

**Figure 12.**Average F1 score per zone and hour of each of the evaluated models considering both the OSN data and the taxi demand as input. (

**a**) F1 score of QUADRIVEN models; (

**b**) F1 score using taxi demand data.

**Figure 13.**F1 score per taxi zone for each of the evaluated models. (

**a**) FCNN F1 score per taxi zone; (

**b**) LSTM F1 score per taxi zone; (

**c**) RF F1 score per taxi zone; (

**d**) SVM F1 score per taxi zone; (

**e**) CRF F1 score per taxi zone.

**Figure 14.**Average F1 score per hour of the day of each of the evaluated models considering both OSN and taxi demand data as input. (

**a**) QUADRIVEN models; (

**b**) Taxi-demand models.

**Figure 15.**F1 score of the original LSTM model integrating the three OSN platforms and the alternative version considering only FS as OSN data.

Flickr | Foursquare | Brightkite | |
---|---|---|---|

Number of users | 5576 | 4531 | 2630 |

Number of documents | 244,464 | 628,941 | 70,642 |

OSN | Taxi Demand |
---|---|

Flickr | 0.9895 |

Foursquare | 0.9871 |

Brightkite | 0.9749 |

Model | Parameter | Value |
---|---|---|

CRF | Training algorithm | Gradient descent |

L1 regularization coeff. | 0.1 | |

L2 regularization coeff. | 0.1 | |

Max. iterations | 1000 | |

RF | Number of estimators | 12,000 |

Max. deep | 1100 | |

SVM | Kernel | Radial Basis Function (RBF) |

Gamma | 0.001 | |

C | 1000 | |

FCNN | Number of layers | 8 |

Number neurons per layer | 128 | |

Activation function | ReLU | |

LSTM | Number of layers | 3 |

Number neurons per layer | 50 | |

Activation function | ReLU |

**Table 4.**Confusion matrix of the five models under study. The best rates per model are marked in bold. The grayed cells contain the highest rates per true label.

Model | Predicted Taxi Range | True Taxi Range | ||
---|---|---|---|---|

${\mathit{tr}}_{\mathit{high}}$ | ${\mathit{tr}}_{\mathit{middle}}$ | ${\mathit{tr}}_{\mathit{low}}$ | ||

RF | $t{r}_{high}$ | 0.763 | 0.232 | 0.005 |

$t{r}_{middle}$ | 0.114 | 0.768 | 0.117 | |

$t{r}_{low}$ | 0.010 | 0.207 | 0.783 | |

SVM | $t{r}_{high}$ | 0.717 | 0.270 | 0.013 |

$t{r}_{middle}$ | 0.059 | 0.792 | 0.149 | |

$t{r}_{low}$ | 0.100 | 0.161 | 0.829 | |

FCNN | $t{r}_{high}$ | 0.701 | 0.293 | 0.006 |

$t{r}_{middle}$ | 0.088 | 0.797 | 0.115 | |

$t{r}_{low}$ | 0.011 | 0.248 | 0.741 | |

LSTM | $t{r}_{high}$ | 0.770 | 0.228 | 0.002 |

$t{r}_{middle}$ | 0.059 | 0.840 | 0.100 | |

$t{r}_{low}$ | 0.000 | 0.152 | 0.846 | |

CRF | $t{r}_{high}$ | 0.561 | 0.435 | 0.004 |

$t{r}_{middle}$ | 0.627 | 0.344 | 0.028 | |

$t{r}_{low}$ | 0.554 | 0.281 | 0.165 |

**Table 5.**Key features of existing prediction models for mobility services. The acronyms’ meaning is as follows: LR, Logistic Regression; MLP, Multi-Layer Perceptron; CNN, Convolutional Neural Network; LSTM, Long Short Term Memory Networks; AE, Auto Encoder; DT, Decision Trees; GP, Gaussian Process and SVR, Support Vector Regression; EL, Ensemble Learning; ConvLSTM, Convolutional Long Short Term Memory Network; LS-SVM, Least Squares Support Vector Machine.

Reference | Data Sources | Data Mining Method | Prediction Target | |||
---|---|---|---|---|---|---|

Temporal | Spatial | Meteorological | Primary Input Data | |||

[24] | ✓ | ✓ | ✓ | taxi demand | regression/LR | quantitative taxi demand |

[3] | ✓ | taxi GPS traces | regression/MLP | taxi destination | ||

[13] | ✓ | ✓ | ✓ | taxi demand | regression/CNN, LSTM | quantitative taxi demand |

[7] | ✓ | ✓ | taxi demand and CDRs | regression/AE | quantitative taxi demand | |

[6] | ✓ | ✓ | taxi demand | regression/CNN | taxi demand peaks | |

[25] | ✓ | ✓ | ✓ | bike demand | regression/CNN | quantitative bike demand |

[2] | ✓ | ✓ | taxi demand and social media | regression/DT | quantitative taxi demand | |

[26] | taxi demand and event data | regression/LR, GP | quantitative taxi demand | |||

[12] | ✓ | ✓ | ✓ | taxi demand | regression/CNN, LSTM | quantitative taxi demand |

[27] | ✓ | taxi demand | regression/SVR | quantitative taxi demand | ||

[8] | ✓ | taxi demand and social media | regression/DT | quantitative taxi demand | ||

[28] | ✓ | ✓ | taxi GPS traces | regression/EL | quantitative taxi demand | |

[9] | ✓ | ✓ | ✓ | taxi demand | regression/ConvLSTM | quantitative taxi demand |

[29] | ✓ | taxi demand | regression/LS-SVM | quantitative taxi demand | ||

QUADRIVEN | ✓ | ✓ | social media | classification | qualitative taxi demand |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Terroso-Saenz, F.; Muñoz, A.; Cecilia, J.M.
QUADRIVEN: A Framework for Qualitative Taxi Demand Prediction Based on Time-Variant Online Social Network Data Analysis. *Sensors* **2019**, *19*, 4882.
https://doi.org/10.3390/s19224882

**AMA Style**

Terroso-Saenz F, Muñoz A, Cecilia JM.
QUADRIVEN: A Framework for Qualitative Taxi Demand Prediction Based on Time-Variant Online Social Network Data Analysis. *Sensors*. 2019; 19(22):4882.
https://doi.org/10.3390/s19224882

**Chicago/Turabian Style**

Terroso-Saenz, Fernando, Andres Muñoz, and José M. Cecilia.
2019. "QUADRIVEN: A Framework for Qualitative Taxi Demand Prediction Based on Time-Variant Online Social Network Data Analysis" *Sensors* 19, no. 22: 4882.
https://doi.org/10.3390/s19224882