A Machine Learning Framework for Harvesting and Harmonizing Cultural and Touristic Data
Abstract
1. Introduction
- (i)
- We propose an innovative, integrated data lake framework capable of meeting several requirements posed by the CI, CH and tourism sectors, such as (a) acquiring cultural data from diverse online endpoints, (b) storing and managing information in properly configured NoSQL data collections, (c) applying data homogenization methods combining multiple string similarity measures with geolocation material, (d) identifying additional PoIs by employing NER language models, and (e) performing opinion mining on user reviews applying sentiment analysis pipelines. The framework exploits the augmented information to derive touristic and cultural routes based on personalized experiences of people who have already traveled there.
- (ii)
- We introduce the architectural approach behind the offered framework, discuss the technological means used to produce the aforementioned services, and illustrate the conceptual coordination of each encapsulated module. In addition, we provide detailed descriptions of the developed services, including the machine-driven data harvesting, the streamlined data homogenization, and the ML-powered augmentation of the data.
- (iii)
- We conduct variable experiments to assess the performance of the data homogenization stage, to evaluate the improved NER language models against baselines, demonstrating their robustness. Sentiment-analyzed data are used to create well-informed visualizations, in order to assess their potential to generate useful insights.
2. Related Work
2.1. Data Collection Employing Crawling Methods
2.2. Homogenization Approaches
2.3. Feature and Entity Extraction Using NLP Techniques
3. Data Harvesting
- The Social Media Harvesting unit, which traverses through the social media public pages, identifying the targeted information, extracting and storing it in a suitably structured database. It encompasses task-specific web scraping modules termed spiders, and a coordination unit which initiates scraping tasks, delegating each task to the appropriate spider, and receives the harvested data for analysis, structuring, and storage. If additional sources of information are identified in the retrieved content, the coordination unit launches additional scraping tasks. The Social Media Harvesting unit is described in detail in Section 3.1.
- The Thematic Harvesting unit performs web scraping on the clear web, discovering additional assets related to a specified topic of interest (CH and tourism in our case), by implementing focused crawling. The Thematic Harvesting unit is discussed in detail in Section 3.2.
3.1. Social Media Harvesting Unit
| Algorithm 1 Social Media Web Scraping Procedure |
| 1: Input: Set of spiders , initial sets of URLs |
| 2: Output: Set of PoIs , set of reviews |
| 3: for each spider do |
| 4: for each url relevant to do |
| 5: Scheduler.enqueue() |
| 6: end for |
| 7: while Scheduler not empty do |
| 8: Scheduler.dequeue() |
| 9: Forward to |
| 10: |
| 11: , |
| 12: Store processed in MySQL DB |
| 13: ▹ Push the retrieved object in the set of PoIs |
| 14: ▹ Generate PoI review URL using unique venue name |
| 15: for each do |
| 16: |
| 17: , ▹ Also, find the relationships |
| between PoI and PoI review |
| 18: Store , in MySQL DB |
| 19: ▹ Push the retrieved review in the set of reviews |
| 20: end for |
| 21: Scheduler.enqueue() |
| 22: end while |
| 23: Transform : columnar → JSON |
| 24: Store transformed data in MongoDB |
| 25: end for |
- 1.
- The Scrapy Engine starts by receiving the initial scraping requests from one of the available spiders. Each spider defines a customized crawling procedure, parses the website structure, and extracts the relevant data.
- 2.
- The Engine sends these requests to the Request Scheduler, which queues them and supplies them back as needed, ensuring a continuous flow of tasks.
- 3.
- The scheduled requests are then forwarded back to the Scrapy Engine.
- 4.
- The Engine passes the requests to the Downloader, which retrieves the web pages via the Downloader Middleware. This middleware manages all outgoing requests and incoming responses between the Engine and Downloader.
- 5.
- Once a page is retrieved, the Downloader sends the response back to the Engine through the Downloader Middleware.
- 6.
- The Engine routes the response through the Spider Middleware to the appropriate Spider. The Spider Middleware facilitates handling of responses, requests, and scraped items between the Engine and Spiders.
- 7.
- The Spider renders the page, simulates user interactions if necessary (e.g., scrolling down, to fetch additional content), navigates through the content, identifies the target data fields, and extracts the information. The operation of spiders is described in more detail in Section 3.1.1.
- 8.
- The extracted data are processed and converted to structured item objects.
- 9.
- The Spider sends the item objects and any new requests back to the Engine via the Spider Middleware.
- 10.
- The Engine forwards the items to the Item Pipeline, a sequence of modules that validate and process each item before storage.
- 11.
- Processed items are initially stored in a MySQL relational database.
- 12.
- The Engine continues requesting the next items from the Scheduler, repeating the process until all scheduled requests are completed.
- 13.
- Finally, all items pass through the Data Management Middleware, which performs data transformation, converting the data from columnar form to JSON format. Once transformed, the data is stored in the MongoDB NoSQL database.


3.1.1. Spiders
Facebook Spiders
TripAdvisor Spiders
GMaps Venue URLs Spider
3.1.2. Web Scraping Issues
3.1.3. Ethical and Legal Concerns
3.2. Thematic Harvesting Unit
- the title_regex examines the HTML title tag of a page to check for matches with a given pattern;
- the url_regex tests the page URL against a list of predefined regular expressions; and
- the body_regex attempts to match patterns within the HTML content of the page.
4. Data Homogenization
- Harvested PoIs are grouped based on their website URL address.
- For venues without an associated website, we apply a two-step process: (1) Initially, we identify the k nearest neighbors of a given PoI based on geographical coordinates. (2) Subsequently, for each candidate within the identified radius, we compute string similarity measures between venue names (using the method in [33]), selecting the most likely match(es) that correspond to the same PoI.
4.1. Website-Based Grouping
- For numeric values representing counts, such as the number of check-ins or the number of ratings, the sum of individual values is computed.
- For numeric averaged values, such as the overall rating score, a new value is computed using the formula , where is the value of the record
- For values not falling in the above categories, the following methods can be applied:
- majority voting, where the most frequently occurring value is maintained,
- value timestamp, where the value associated with the most recently updated source is selected,
- manual review, where a human operator reviews values and selects the correct value.
4.2. Closest Neighbors Candidate Matching
4.2.1. Blocking
4.2.2. Approximate Similarity Measures
4.2.3. Homogenization Outcome
- 1.
- cand_id, an integer numerical value greater than zero that refers to the primary key index of the candidate PoI in DB1 (the “main” venues table)
- 2.
- cand_name, a string value that denotes the name of the candidate PoI in DB1
- 3.
- cand_lon, a float numeric value in the range that corresponds to the longitude of the candidate PoI in DB1
- 4.
- cand_lat, a float numeric value in the range that corresponds to the latitude of the candidate PoI in DB1
- 5.
- id, an integer numeric value that refers to the primary key index of the matching PoI in DB2, linking the specific PoI from the main database to its corresponding potential match in the list of newly harvested PoIs
- 6.
- name, a string value that denotes the name of the PoI in DB2
- 7.
- lon, the longitude of PoI in DB2
- 8.
- lat, the latitude of PoI in DB2
- 9.
- <similarity measures>, a list of real-valued scores in the interval , where each element corresponds to the similarity value produced by a specific string similarity measure applied to the candidate pair of entries
- 10.
- <similarity measures>_status, a list of Boolean values, where each element indicates whether the corresponding similarity score exceeds its threshold (Table 4)
- 11.
- on_translit, a Boolean flag indicating whether the similarity scores were computed on the transliterated PoI names or on their original forms
- 12.
- status, a Boolean value representing the final matching decision for the candidate pair, obtained by combining the outcomes of all applied similarity measures
4.2.4. PoI Record Completeness and Correctness
5. Data Augmentation
5.1. Automatic Translation Component
5.2. Entity Extraction Component
5.2.1. NER Model Evaluation and Selection
- The Type parameter of the model indicates the NLP tasks that can be supported by the model. More specifically, core-type models implement a versatile pipeline that offers tagging, parsing, lemmatization and NER, whereas dep-type models can be used only for tagging, parsing and lemmatization;
- The Genre parameter of the model indicates the corpus category on which the model is trained, which in turn affects the suitability of the model for application domains. More specifically, the corpora categories are web (the corpus comprises generic web content) and news (the corpus is limited to material harvested from news agencies, sites etc.);
- The Size parameter of the model indicates either the volume of data of the model, with available versions being small (sm), medium (md) and large (lg), or that the model is a transformer-based one (trf).
- The small-size model (encoded as en_core_web_sm),
- The medium-size model (encoded as en_core_web_md),
- The large-size model (encoded as en_core_web_lg), and
- The transformer-based model (encoded as en_core_web_trf).
5.2.2. Improving Model Accuracy
5.2.3. Evaluation of the NER Models
- Precision is defined as the fraction of correct named entities identified by the NER model, i.e., those matching the Golden Annotation Standard (GAS), over the total number of entities predicted by the model. The denominator consists of the sum of correct entities, incorrect entities (those not matching the GAS), and spurious entities (those falsely identified as matches). Formally, this can be expressed as:Note that stands for True Positives, while stands for False Positives.
- Recall is defined as the fraction of correctly identified named entities relative to the total number of entities that should have been identified according to the GAS. The denominator includes the sum of correct entities, incorrect entities, and missing entities (i.e., entities not detected by the model). Formally, this can be expressed as:Note that stands for False Negatives.
- F-Measure is the weighted average of Precision and Recall, that takes into account both and . Formally, it can be expressed as:
- Correctly identified is defined as the fraction of correct named entities identified by the NER model (i.e., those matching the GAS), over the total number of entities that appear in the document and are either correctly detected, incorrectly detected, or not detected at all. The denominator consists of the sum of correct entities, incorrect entities (those not matching the GAS), and missing entities (those not detected by the model). Formally, this can be expressed as:
- Falsely identified is defined as the fraction of the total named entities falsely recognized by the NER model (i.e., those not matching the GAS), over the total number of entities that appear in the document and are either correctly detected, incorrectly detected, or not detected at all. Typically, this can be formed as:
- Not identified is defined as the fraction of named entities that were not detected by the NER model, over the total number of entities that appear in the document and are either correctly detected, incorrectly detected, or not detected at all. Formally, this can be expressed as:
5.3. Sentiment Analysis of User Reviews
5.4. Recommendation of Sightseeing Trajectories
5.4.1. Phase 1: Automatic Translation of Multilingual Review Content
5.4.2. Phase 2: Location Entity Extraction
5.4.3. Phase 3: Sentiment Analysis of User Reviews Containing Routes
5.4.4. Phase 4: Route Generation and Recommendation


6. Discussion
6.1. Data Workflow
6.2. Methodology Validation
6.3. Limitations
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Ardito, L.; Cerchione, R.; Del Vecchio, P.; Raguseo, E. Big data in smart tourism: Challenges, issues and opportunities. Curr. Issues Tour. 2019, 22, 1805–1809. [Google Scholar] [CrossRef]
- Xiang, Z.; Fesenmaier, D.R. Big data analytics, tourism design and smart tourism. In Analytics in Smart Tourism Design: Concepts and Methods; Springer: Cham, Switzerland, 2017; pp. 299–307. [Google Scholar]
- Özemre, M.; Kabadurmus, O. A big data analytics based methodology for strategic decision making. J. Enterp. Inf. Manag. 2020, 33, 1467–1490. [Google Scholar] [CrossRef]
- Li, J.; Xu, L.; Tang, L.; Wang, S.; Li, L. Big data in tourism research: A literature review. Tour. Manag. 2018, 68, 301–323. [Google Scholar] [CrossRef]
- Vajjhala, N.R.; Strang, K.D. Measuring organizational-fit through socio-cultural big data. New Math. Nat. Comput. 2017, 13, 145–158. [Google Scholar] [CrossRef]
- Lomborg, S.; Bechmann, A. Using APIs for Data Collection on Social Media. Inf. Soc. 2014, 30, 256–265. [Google Scholar] [CrossRef]
- Murray-Rust, P. Open data in science. Nat. Preced. 2008, 34, 52–64. [Google Scholar] [CrossRef]
- Couper, M.P. New developments in survey data collection. Annu. Rev. Sociol. 2017, 43, 121–145. [Google Scholar] [CrossRef]
- Fuhr, N.; Tsakonas, G.; Aalberg, T.; Agosti, M.; Hansen, P.; Kapidakis, S.; Klas, C.; Kovács, L.; Landoni, M.; Micsik, A.; et al. Evaluation of digital libraries. Int. J. Digit. Libr. 2007, 8, 21–38. [Google Scholar] [CrossRef]
- Khder, M.A. Web Scraping or Web Crawling: State of Art, Techniques, Approaches and Application. Int. J. Adv. Soft Comput. Its Appl. 2021, 13, 145–168. [Google Scholar] [CrossRef]
- Deligiannis, K.; Tryfonopoulos, C.; Raftopoulou, P.; Platis, N.; Vassilakis, C. EnQuest: A Cloud Datalake Infrastructure for Heterogeneous Analytics in Maritime and Tourism Domains. In Proceedings of the DATA 2023 (demo), Rome, Italy, 11–13 July 2023. [Google Scholar]
- Barbera, G.; Araujo, L.; Fernandes, S.C. The Value of Web Data Scraping: An Application to TripAdvisor. Big Data Cogn. Comput. 2023, 7, 121. [Google Scholar] [CrossRef]
- Deligiannis, K.; Raftopoulou, P.; Tryfonopoulos, C.; Platis, N.; Vassilakis, C. Hydria: An Online Data Lake for Multi-Faceted Analytics in the Cultural Heritage Domain. Big Data Cogn. Comput. 2020, 4, 7. [Google Scholar] [CrossRef]
- Vonitsanos, G.; Kanavos, A.; Mohasseb, A.; Tsolis, D. A NoSQL Approach for Aspect Mining of Cultural Heritage Streaming Data. In Proceedings of the 10th International Conference on Information, Intelligence, Systems and Applications, IISA 2019, Patras, Greece, 15–17 July 2019; Bourbakis, N.G., Tsihrintzis, G.A., Virvou, M., Eds.; IEEE: Piscataway, NJ, USA, 2019; pp. 1–4. [Google Scholar] [CrossRef]
- Freire, N.; Silva, M.J. Domain-Focused Linked Data Crawling Driven by a Semantically Defined Frontier—A Cultural Heritage Case Study in Europeana. In Proceedings of the Digital Libraries at Times of Massive Societal Transition—22nd International Conference on Asia-Pacific Digital Libraries, ICADL 2020, Kyoto, Japan, 30 November–1 December 2020; Ishita, E., Pang, N.L., Zhou, L., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2020; Volume 12504, pp. 340–348. [Google Scholar] [CrossRef]
- Wang, W.; Yu, L. UCrawler: A learning-based web crawler using a URL knowledge base. J. Comput. Methods Sci. Eng. 2021, 21, 461–474. [Google Scholar] [CrossRef]
- Wang, K.; Jalal, M.; Jefferson, S.; Zheng, Y.; Nsoesie, E.O.; Betke, M. Scraping Social Media Photos Posted in Kenya and Elsewhere to Detect and Analyze Food Types. arXiv 2019, arXiv:1909.00134. Available online: http://arxiv.org/abs/1909.00134 (accessed on 13 November 2025). [CrossRef]
- Autelitano, A.; Pernici, B.; Scalia, G. Spatio-temporal mining of keywords for social media cross-social crawling of emergency events. GeoInformatica 2019, 23, 425–447. [Google Scholar] [CrossRef]
- AlZu’bi, S.; Aqel, D.; Mughaid, A.; Jararweh, Y. A Multi-Levels Geo-Location Based Crawling Method for Social Media Platforms. In Proceedings of the Sixth International Conference on Social Networks Analysis, Management and Security, SNAMS 2019, Granada, Spain, 22–25 October 2019; Alsmirat, M.A., Jararweh, Y., Eds.; IEEE: Piscataway, NJ, USA, 2019; pp. 494–498. [Google Scholar] [CrossRef]
- Xu, K.; Gao, K.Y.; Callan, J. A Structure-Oriented Unsupervised Crawling Strategy for Social Media Sites. arXiv 2018, arXiv:1804.02734. Available online: http://arxiv.org/abs/1804.02734 (accessed on 13 November 2025). [CrossRef]
- Erlandsson, F.; Bródka, P.; Boldt, M.; Johnson, H. Do We Really Need to Catch Them All? A New User-Guided Social Media Crawling Method. Entropy 2017, 19, 686. [Google Scholar] [CrossRef]
- Wani, M.A.; Agarwal, N.; Jabin, S.; Hussai, S.Z. Design of iMacros-Based Data Crawler and the Behavioral Analysis of Facebook Users. arXiv 2018, arXiv:1802.09566. [Google Scholar]
- Seo, M.; Kim, J.; Yang, H. Frequent Interaction and Fast Feedback Predict Perceived Social Support: Using Crawled and Self-Reported Data of Facebook Users. J. Comput. Mediat. Commun. 2016, 21, 282–297. [Google Scholar] [CrossRef]
- Hernandez-Suarez, A.; Sanchez-Perez, G.; Toscano-Medina, K.; Martinez-Hernandez, V.; Sanchez, V.; Pérez-Meana, H. A Web Scraping Methodology for Bypassing Twitter API Restrictions. arXiv 2018, arXiv:1803.09875. Available online: http://arxiv.org/abs/1803.09875 (accessed on 13 November 2025). [CrossRef]
- El Akbar, R.R.; Shofa, R.N.; Paripurna, M.I.; Supratman. The implementation of Naïve Bayes algorithm for classifying tweets containing hate speech with political motive. In Proceedings of the 2019 International Conference on Sustainable Engineering and Creative Computing (ICSECC), Bandung, Indonesia, 20–22 August 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 144–148. [Google Scholar]
- Xiang, Z.; Du, Q.; Ma, Y.; Fan, W. Assessing reliability of social media data: Lessons from mining TripAdvisor hotel reviews. J. Inf. Technol. Tour. 2018, 18, 43–59. [Google Scholar] [CrossRef]
- Li, J.; Yang, L. A Rule-Based Chinese Sentiment Mining System with Self-Expanding Dictionary—Taking TripAdvisor as an Example. In Proceedings of the 14th IEEE International Conference on e-Business Engineering, ICEBE 2017, Shanghai, China, 4–6 November 2017; Hussain, O., Jiang, L., Fei, X., Lan, C., Chao, K., Eds.; IEEE Computer Society: Piscataway, NJ, USA, 2017; pp. 238–242. [Google Scholar] [CrossRef]
- Xhumari, E.; Xhumari, I. A review of web crawling approaches. In Proceedings of the 4th International Conference on Recent Trends and Applications in Computer Science and Information Technology, Tirana, Albania, 21–22 May 2021; Volume 2872, pp. 158–163. [Google Scholar]
- Gupta, A.; Anand, P. Focused web crawlers and its approaches. In Proceedings of the 2015 International Conference on Futuristic Trends on Computational Analysis and Knowledge Management (ABLAZE), Greater Noida, India, 25–27 February 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 619–622. [Google Scholar]
- Saini, C.; Arora, V. Information retrieval in web crawling: A survey. In Proceedings of the 2016 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2016, Jaipur, India, 21–24 September 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 2635–2643. [Google Scholar] [CrossRef]
- Elmagarmid, A.K.; Ipeirotis, P.G.; Verykios, V.S. Duplicate Record Detection: A Survey. IEEE Trans. Knowl. Data Eng. 2007, 19, 1–16. [Google Scholar] [CrossRef]
- Martins, B. A Supervised Machine Learning Approach for Duplicate Detection over Gazetteer Records. In Proceedings of the GeoSpatial Semantics–4th International Conference, GeoS 2011, Brest, France, 12–13 May 2011; Claramunt, C., Levashkin, S., Bertolotto, M., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2011; Volume 6631, pp. 34–51. [Google Scholar] [CrossRef]
- Santos, R.; Murrieta-Flores, P.; Martins, B. Learning to combine multiple string similarity metrics for effective toponym matching. Int. J. Digit. Earth 2018, 11, 913–938. [Google Scholar] [CrossRef]
- Long, Y.; Li, H.; Wan, Z.; Tian, P. Data Redundancy Detection Algorithm based on Multidimensional Similarity. In Proceedings of the 2023 International Conference on Frontiers of Robotics and Software Engineering (FRSE), Changsha, China, 16–18 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 180–187. [Google Scholar]
- Omar, Z.A.; Abu Bakar, M.A.; Zamzuri, Z.H.; Ariff, N.M. Duplicate Detection Using Unsupervised Random Forests: A Preliminary Analysis. In Proceedings of the 2022 3rd International Conference on Artificial Intelligence and Data Sciences (AiDAS), Ipoh, Malaysia, 7–8 September 2022; pp. 66–71. [Google Scholar] [CrossRef]
- Lin, Y.; Wang, H.; Li, J.; Gao, H. Efficient Entity Resolution on Heterogeneous Records. IEEE Trans. Knowl. Data Eng. 2020, 32, 912–926. [Google Scholar] [CrossRef]
- Zhang, D.; Li, D.; Guo, L.; Tan, K.L. Unsupervised Entity Resolution with Blocking and Graph Algorithms. IEEE Trans. Knowl. Data Eng. 2022, 34, 1501–1515. [Google Scholar] [CrossRef]
- Cao, K.; Liu, H. Entity Resolution Algorithm for Heterogeneous Data Sources. In Proceedings of the 2021 International Conference on Computer Information Science and Artificial Intelligence (CISAI), Kunming, China, 17–19 September 2021; pp. 553–557. [Google Scholar] [CrossRef]
- Karapiperis, D.; Gkoulalas-Divanis, A.; Verykios, V.S. MultiBlock: A Scalable Iterative Approach for Progressive Entity Resolution. In Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 15–18 December 2021; pp. 219–228. [Google Scholar] [CrossRef]
- Park, S.; Lee, S.; Woo, S.S. BertLoc: Duplicate location record detection in a large-scale location dataset. In Proceedings of the 36th Annual ACM Symposium on Applied Computing, SAC ’21, New York, NY, USA, 22–26 March 2021; pp. 942–951. [Google Scholar] [CrossRef]
- Gu, Q.; Dong, Y.; Hu, Y.; Liu, Y. A Method for Duplicate Record Detection Using Deep Learning. In Proceedings of the Web Information Systems and Applications—16th International Conference, WISA 2019, Qingdao, China, 20–22 September 2019; Ni, W., Wang, X., Song, W., Li, Y., Eds.; Springer: Cham, Switzerland, 2019; pp. 85–91. [Google Scholar] [CrossRef]
- Lattar, H.; Ben Salem, A.; Ben Ghezala, H.H. Duplicate record detection approach based on sentence embeddings. In Proceedings of the 2020 IEEE 29th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), Virtual, 10–13 September 2020; pp. 269–274. [Google Scholar] [CrossRef]
- Ziv, R.; Gronau, I.; Fire, M. CompanyName2Vec: Company Entity Matching based on Job Ads. In Proceedings of the 9th IEEE International Conference on Data Science and Advanced Analytics, DSAA 2022, Shenzhen, China, 13–16 October 2022; Huang, J.Z., Pan, Y., Hammer, B., Khan, M.K., Xie, X., Cui, L., He, Y., Eds.; IEEE: Piscataway, NJ, USA, 2022; pp. 1–10. [Google Scholar] [CrossRef]
- Zhang, P. Similar Duplicate Record Detection of Big Data Based on Entropy Grouping Clustering. In Proceedings of the 2022 5th International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE), Wuhan, China, 22–24 April 2022; pp. 646–650. [Google Scholar] [CrossRef]
- Abbes, H.; Gargouri, F. MongoDB-Based Modular Ontology Building for Big Data Integration. J. Data Semant. 2018, 7, 1–27. [Google Scholar] [CrossRef]
- Song, R.; Yu, T.; Chen, Y.; Chen, Y.; Xia, B. An Approximately Duplicate Records Detection Method for Electric Power Big Data Based on Spark and IPOP-Simhash. J. Inf. Hiding Multim. Signal Process. 2018, 9, 410–422. [Google Scholar]
- Qi, Y.; Ren, W.; Shi, M.; Liu, Q. A Combinatorial Method based on Machine Learning Algorithms for Enhancing Cultural Economic Value. Int. J. Perform. Eng. 2020, 16, 1105–1117. [Google Scholar] [CrossRef]
- Penagos-Londoño, G.I.; Rodriguez-Sanchez, C.; Ruiz-Moreno, F.; Torres, E. A machine learning approach to segmentation of tourists based on perceived destination sustainability and trustworthiness. J. Destin. Mark. Manag. 2021, 19, 100532. [Google Scholar] [CrossRef]
- Kar, A.K.; Choudhary, S.K.; Ilavarasan, P.V. How can we improve tourism service experiences: Insights from multi-stakeholders’ interaction. Decision 2023, 50, 73–89. [Google Scholar] [CrossRef]
- Tae, K.H.; Roh, Y.; Oh, Y.H.; Kim, H.; Whang, S.E. Data cleaning for accurate, fair, and robust models: A big data-AI integration approach. In Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning, Amsterdam, The Netherlands, 30 June 2019; pp. 1–4. [Google Scholar]
- Fu, Y.; Shi, K.; Xi, L. Artificial intelligence and machine learning in the preservation and innovation of intangible cultural heritage: Ethical considerations and design frameworks. Digit. Scholarsh. Humanit. 2025, 40, 487–508. [Google Scholar] [CrossRef]
- de la Rosa, J. Machine learning at the National Library of Norway. In Navigating Artificial Intelligence for Cultural Heritage Organisations; Jaillant, L., Warwick, C., Gooding, P., Aske, K., Layne-Worthey, G., Downie, J.S., Eds.; UCL Press: London, UK, 2025; pp. 61–90. [Google Scholar] [CrossRef]
- Sousa, J.J.; Lin, J.; Wang, Q.; Liu, G.; Fan, J.; Bai, S.; Zhao, H.; Pan, H.; Wei, W.; Rittlinger, V.; et al. Using machine learning and satellite data from multiple sources to analyze mining, water management, and preservation of cultural heritage. Geo Spat. Inf. Sci. 2024, 27, 552–571. [Google Scholar] [CrossRef]
- Nadkarni, P.M.; Ohno-Machado, L.; Chapman, W.W. Natural language processing: An introduction. J. Am. Med. Inform. Assoc. 2011, 18, 544–551. [Google Scholar] [CrossRef]
- Schmitt, X.; Kubler, S.; Robert, J.; Papadakis, M.; LeTraon, Y. A replicable comparison study of NER software: StanfordNLP, NLTK, OpenNLP, SpaCy, Gate. In Proceedings of the 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS), Granada, Spain, 22–25 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 338–343. [Google Scholar]
- Miao, Y.; Jin, Z.; Zhang, Y.; Chen, Y.; Lai, J. Compare Machine Learning Models in Text Classification Using Steam User Reviews. In Proceedings of the 2021 3rd International Conference on Software Engineering and Development (ICSED), Xiamen, China, 19–21 November 2021; pp. 40–45. [Google Scholar]
- Kadam, K.; Godbole, S.; Joijode, D.; Karoshi, S.; Jadhav, P.; Shilaskar, S. Multilingual Information Retrieval Chatbot. In Modern Approaches in Machine Learning & Cognitive Science: A Walkthrough; Studies in Computational Intelligence; Springer: Cham, Switzerland, 2022; Volume 1027, pp. 107–121. [Google Scholar] [CrossRef]
- Ferreira, J.; Gonçalo Oliveira, H.; Rodrigues, R. Improving NLTK for processing Portuguese. In Proceedings of the 8th Symposium on Languages, Applications and Technologies (SLATE 2019), Coimbra, Portugal, 27–28 June 2019; Schloss Dagstuhl-Leibniz-Zentrum für Informatik: Wadern, Germany, 2019. [Google Scholar] [CrossRef]
- Kapan, A.; Kirmizialtin, S.; Kukreja, R.; Wrisley, D.J. Fine Tuning NER with spaCy for Transliterated Entities Found in Digital Collections from the Multilingual Arabian/Persian Gulf. In Proceedings of the 6th Digital Humanities in the Nordic and Baltic Countries Conference (DHNB 2022), Uppsala, Sweden, 15–18 March 2022. [Google Scholar]
- Rouhou, A.C.; Dhiaf, M.; Kessentini, Y.; Salem, S.B. Transformer-based approach for joint handwriting and named entity recognition in historical document. Pattern Recognit. Lett. 2022, 155, 128–134. [Google Scholar] [CrossRef]
- Hanks, C.; Maiden, M.; Ranade, P.; Finin, T.; Joshi, A. CyberEnt: Extracting Domain Specific Entities from Cybersecurity Text. In Proceedings of the Mid-Atlantic Student Colloquium on Speech, Language and Learning, Philadelphia, PA, USA, 30 April 2022. [Google Scholar]
- Alam, T.; Bhusal, D.; Park, Y.; Rastogi, N. Cyner: A python library for cybersecurity named entity recognition. arXiv 2022, arXiv:2204.05754. [Google Scholar] [CrossRef]
- Koloveas, P.; Chantzios, T.; Alevizopoulou, S.; Skiadopoulos, S.; Tryfonopoulos, C. Intime: A machine learning-based framework for gathering and leveraging web data to cyber-threat intelligence. Electronics 2021, 10, 818. [Google Scholar] [CrossRef]
- De Magistris, G.; Russo, S.; Roma, P.; Starczewski, J.T.; Napoli, C. An explainable fake news detector based on named entity recognition and stance classification applied to COVID-19. Information 2022, 13, 137. [Google Scholar] [CrossRef]
- ElDin, H.G.; AbdulRazek, M.; Abdelshafi, M.; Sahlol, A.T. Med-Flair: Medical named entity recognition for diseases and medications based on Flair embedding. Procedia Comput. Sci. 2021, 189, 67–75. [Google Scholar] [CrossRef]
- Patel, H. Bionerflair: Biomedical named entity recognition using flair embedding and sequence tagger. arXiv 2020, arXiv:2011.01504. [Google Scholar]
- Kim, H.; Kang, J. How Do Your Biomedical Named Entity Recognition Models Generalize to Novel Entities? IEEE Access 2022, 10, 31513–31523. [Google Scholar] [CrossRef] [PubMed]
- Chai, Z.; Jin, H.; Shi, S.; Zhan, S.; Zhuo, L.; Yang, Y. Hierarchical shared transfer learning for biomedical named entity recognition. BMC Bioinform. 2022, 23, 8. [Google Scholar] [CrossRef] [PubMed]
- Uronen, L.; Salanterä, S.; Hakala, K.; Hartiala, J.; Moen, H. Combining supervised and unsupervised named entity recognition to detect psychosocial risk factors in occupational health checks. Int. J. Med Inform. 2022, 160, 104695. [Google Scholar] [CrossRef]
- Tambuscio, M.; Andrews, T.L. Geolocation and Named Entity Recognition in Ancient Texts: A Case Study about Ghewond’s Armenian History. In Proceedings of the CHR, Online, 17–19 November 2021; pp. 136–148. [Google Scholar]
- Milanova, I.; Silc, J.; Serucnik, M.; Eftimov, T.; Gjoreski, H. LOCALE: A Rule-based Location Named-entity Recognition Method for Latin Text. In Proceedings of the 5th International Workshop on Computational History, HistoInformatics@TPDL 2019, Oslo, Norway, 12 September 2019; Volume 2461, pp. 13–20. [Google Scholar]
- Molina-Villegas, A.; Muñiz-Sanchez, V.; Arreola-Trapala, J.; Alcántara, F. Geographic Named Entity Recognition and Disambiguation in Mexican News using word embeddings. Expert Syst. Appl. 2021, 176, 114855. [Google Scholar] [CrossRef]
- Koirala, S. Event Discovery from Social Media Feeds. 2021. Available online: https://urn.fi/URN:NBN:fi:aalto-2021121910935 (accessed on 13 November 2025).
- Finin, T.; Murnane, W.; Karandikar, A.; Keller, N.; Martineau, J.; Dredze, M. Annotating named entities in Twitter data with crowdsourcing. In Proceedings of the NAACL Workshop on Creating Speech and Text Language Data with Amazon’s Mechanical Turk, Los Angeles, CA, USA, 6 June 2010. [Google Scholar]
- Liu, L.; Wang, M.; Zhang, M.; Qing, L.; He, X. UAMNer: Uncertainty-aware multimodal named entity recognition in social media posts. Appl. Intell. 2022, 52, 4109–4125. [Google Scholar] [CrossRef]
- Asgari-Chenaghlu, M.; Feizi-Derakhshi, M.R.; Farzinvash, L.; Balafar, M.; Motamed, C. CWI: A multimodal deep learning approach for named entity recognition from social media using character, word and image features. Neural Comput. Appl. 2022, 34, 1905–1922. [Google Scholar] [CrossRef]
- Eligüzel, N.; Çetinkaya, C.; Dereli, T. Application of named entity recognition on tweets during earthquake disaster: A deep learning-based approach. Soft Comput. 2022, 26, 395–421. [Google Scholar] [CrossRef]
- Egger, R.; Gokce, E. Natural Language Processing (NLP): An Introduction: Making Sense of Textual Data. In Applied Data Science in Tourism: Interdisciplinary Approaches, Methodologies, and Applications; Springer International Publishing: Cham, Switzerland, 2022; pp. 307–334. [Google Scholar]
- Bouabdallaoui, I.; Guerouate, F.; Bouhaddour, S.; Saadi, C.; Sbihi, M. Named Entity Recognition applied on Moroccan tourism corpus. Procedia Comput. Sci. 2022, 198, 373–378. [Google Scholar] [CrossRef]
- Chantrapornchai, C.; Tunsakul, A. Information extraction on tourism domain using SpaCy and BERT. ECTI Trans. Comput. Inform. Technol 2021, 15, 108–122. [Google Scholar]
- Montoyo, A.; Martínez-Barco, P.; Balahur, A. Subjectivity and sentiment analysis: An overview of the current state of the area and envisaged developments. Decis. Support Syst. 2012, 53, 675–679. [Google Scholar] [CrossRef]
- Kauffmann, E.; Peral, J.; Gil, D.; Ferrández, A.; Sellers, R.; Mora, H. Managing marketing decision-making with sentiment analysis: An evaluation of the main product features using text data mining. Sustainability 2019, 11, 4235. [Google Scholar] [CrossRef]
- Ahuja, S.; Dubey, G. Clustering and sentiment analysis on Twitter data. In Proceedings of the 2017 2nd International Conference on Telecommunication and Networks (TEL-NET), Noida, India, 10–11 August 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–5. [Google Scholar]
- Bravo-Marquez, F.; Mendoza, M.; Poblete, B. Combining strengths, emotions and polarities for boosting Twitter sentiment analysis. In Proceedings of the Second International Workshop on Issues of Sentiment Discovery and Opinion Mining, WISDOM 2013, Chicago, IL, USA, 11 August 2013; Cambria, E., Liu, B., Zhang, Y., Xia, Y., Eds.; ACM: New York, NY, USA, 2013; pp. 1–9. [Google Scholar] [CrossRef]
- Subbaiah, B.; Murugesan, K.; Saravanan, P.; Marudhamuthu, K. An efficient multimodal sentiment analysis in social media using hybrid optimal multi-scale residual attention network. Artif. Intell. Rev. 2024, 57, 34. [Google Scholar] [CrossRef]
- Thareja, R. Multimodal Sentiment Analysis of Social Media Content and Its Impact on Mental Wellbeing: An Investigation of Extreme Sentiments. In Proceedings of the 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD), Bangalore, India, 4–7 January 2024; Natarajan, S., Bhattacharya, I., Singh, R., Kumar, A., Ranu, S., Bali, K., K, A., Eds.; ACM: New York, NY, USA, 2024; pp. 469–473. [Google Scholar] [CrossRef]
- Rodríguez-Ibáñez, M.; Casañez-Ventura, A.; Castejón-Mateos, F.; Cuenca-Jiménez, P.M. A review on sentiment analysis from social media platforms. Expert Syst. Appl. 2023, 223, 119862. [Google Scholar] [CrossRef]
- Cuzzocrea, A. Big data lakes: Models, frameworks, and techniques. In Proceedings of the 2021 IEEE International Conference on Big Data and Smart Computing (BigComp), Jeju Island, Republic of Korea, 17–20 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–4. [Google Scholar]
- Facebook. Available online: https://www.facebook.com/ (accessed on 18 October 2025).
- TripAdvisor. Available online: https://www.tripadvisor.com/ (accessed on 18 October 2025).
- Google Maps. Available online: https://www.google.com/maps (accessed on 18 October 2025).
- Odysseus Portal—Ministry of Culture and Sports. Available online: http://odysseus.culture.gr/index_en.html (accessed on 18 October 2025).
- Search Culture—Culture in the Digital Public Space. Available online: https://www.searchculture.gr/aggregator/portal/?language=en (accessed on 18 October 2025).
- Portal of Cultural Stakeholders. Available online: https://portal.culture.gov.gr/ (accessed on 18 October 2025).
- Athens Culture Net. Available online: https://www.cityofathens.gr/who/athens-culture-net/ (accessed on 13 October 2025).
- X Microblogging Platform. Available online: https://x.com/ (accessed on 18 October 2025).
- Hridoy, M.T.A.; Saha, S.R.; Islam, M.M.; Uddin, M.A.; Mahmud, M.Z. Leveraging web scraping and stacking ensemble machine learning techniques to enhance detection of major depressive disorder from social media posts. Soc. Netw. Anal. Min. 2024, 14, 239. [Google Scholar] [CrossRef]
- Reddit Platform. Available online: https://www.reddit.com/ (accessed on 18 October 2025).
- Bouabdelli, L.F.; Abdelhedi, F.; Hammoudi, S.; Hadjali, A. An Advanced Entity Resolution in Data Lakes: First Steps. In Proceedings of the 14th International Conference on Data Science, Technology and Applications, DATA 2025, Bilbao, Spain, 10–12 June 2025; pp. 661–668. [Google Scholar] [CrossRef]
- Liu, C.; Rong, X. Automated Graph Attention Network for Heterogeneous Entity Resolution. In Proceedings of the 2025 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2025, Hyderabad, India, 6–11 April 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 1–5. [Google Scholar] [CrossRef]
- Levchenko, M. Evaluating Named Entity Recognition Models for Russian Cultural News Texts: From BERT to LLM. arXiv 2025, arXiv:2506.02589. [Google Scholar] [CrossRef]
- Li, Y.; Yan, H.; Yang, Y.; Wang, X. A Method for Cultural Relics Named Entity Recognition Based on Enhanced Lexical Features. In Proceedings of the International Joint Conference on Neural Networks, IJCNN 2024, Yokohama, Japan, 30 June–5 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–8. [Google Scholar] [CrossRef]
- SpaCy, Industrial-Strength Natural Language Processing. Available online: https://spacy.io/ (accessed on 18 October 2025).
- Kumar, D.; Pandey, S.; Patel, P.; Choudhari, K.; Hajare, A.; Jante, S. Generalized Named Entity Recognition Framework. In Proceedings of the 2021 Asian Conference on Innovation in Technology (ASIANCON), Pune, India, 27–29 August 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–4. [Google Scholar]
- Ferro, S.; Giovanelli, R.; Leeson, M.; Bernardin, M.D.; Traviglia, A. A novel NLP-driven approach for enriching artefact descriptions, provenance, and entities in cultural heritage. Neural Comput. Appl. 2025, 37, 21275–21296. [Google Scholar] [CrossRef]
- Zheng, Y.; Li, F.; Li, C.; Zhang, Z.; Cao, R.; Noman, S.M. A Natural Language Processing Model for Automated Organization and Analysis of Intangible Cultural Heritage. J. Organ. End User Comput. 2024, 36, 1–27. [Google Scholar] [CrossRef]
- Loper, E.; Bird, S. Nltk: The natural language toolkit. arXiv 2002, arXiv:cs/0205028. [Google Scholar] [CrossRef]
- Akbik, A.; Bergmann, T.; Blythe, D.; Rasul, K.; Schweter, S.; Vollgraf, R. FLAIR: An easy-to-use framework for state-of-the-art NLP. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Minneapolis, MN, USA, 2–7 June 2019; pp. 54–59. [Google Scholar]
- Čeović, H.; Kurdija, A.S.; Delač, G.; Šilić, M. Named Entity Recognition for Addresses: An Empirical Study. IEEE Access 2022, 10, 42108–42120. [Google Scholar] [CrossRef]
- Yu, J.; Ji, B.; Li, S.; Ma, J.; Liu, H.; Xu, H. S-NER: A Concise and Efficient Span-Based Model for Named Entity Recognition. Sensors 2022, 22, 2852. [Google Scholar] [CrossRef]
- Zhao, X.; Greenberg, J.; An, Y.; Hu, X.T. Fine-Tuning BERT Model for Materials Named Entity Recognition. In Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 15–18 December; IEEE: Piscataway, NJ, USA, 2021; pp. 3717–3720. [Google Scholar]
- Krovetz, R.; Deane, P.; Madnani, N. The Web is not a PERSON, Berners-Lee is not an ORGANIZATION, and African-Americans are not LOCATIONS: An Analysis of the Performance of Named-Entity Recognition. In Proceedings of the Workshop on Multiword Expressions: From Parsing and Generation to the Real World, MWE@ACL 2011, Portland, OR, USA, 23 June 2011; Kordoni, V., Ramisch, C., Villavicencio, A., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2011; pp. 57–64. [Google Scholar]
- Mehrabi, N.; Gowda, T.; Morstatter, F.; Peng, N.; Galstyan, A. Man is to Person as Woman is to Location: Measuring Gender Bias in Named Entity Recognition. In Proceedings of the HT ’20: 31st ACM Conference on Hypertext and Social Media, Virtual Event, 13–15 July 2020; Gadiraju, U., Ed.; ACM: New York, NY, USA, 2020; pp. 231–232. [Google Scholar] [CrossRef]
- Myers, D.; McGuffee, J.W. Choosing scrapy. J. Comput. Sci. Coll. 2015, 31, 83–89. [Google Scholar]
- Hoffman, P.; Grana, D.; Olveyra, M.; Garcia, G.; Cetrulo, M.; Bogomyagkov, A.; Canabal, D.; Moreira, A.; Carnales, I.; Aguirre, M.; et al. Scrapy at a Glance. Available online: https://docs.scrapy.org/en/latest/intro/overview.html (accessed on 18 October 2025).
- Vassilakis, C.; Poulopoulos, V.; Wallace, M.; Antoniou, A.; Lepouras, G. TripMentor Project: Scope and Challenges. In Proceedings of the Workshop on Cultural Informatics Co-Located with the 14th International Workshop on Semantic and Social Media Adaptation and Personalization, CI@SMAP 2019, Larnaca, Cyprus, 9 June 2019; Volume 2412. [Google Scholar]
- Mehta, H.; Kanani, P.; Lande, P. Google maps. Int. J. Comput. Appl 2019, 178, 41–46. [Google Scholar] [CrossRef]
- Selenium Web Driver. Available online: https://www.selenium.dev/documentation/webdriver/ (accessed on 18 October 2025).
- Peng, J.; Ma, Y.; Zhou, F.-r.; Wang, S.-l.; Zheng, Z.-z.; Li, J. Web Crawler of Power Grid Based on Selenium. In Proceedings of the 2019 16th International Computer Conference on Wavelet Active Media Technology and Information Processing, Chengdu, China, 13–15 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 114–118. [Google Scholar]
- USA: Web Scraping Held to Be Legal in Lawsuit Brought by LinkedIn over Privacy Concerns. Available online: https://www.business-humanrights.org/en/latest-news/usa-web-scraping-held-to-be-legal-in-lawsuit-brought-by-linkedin-over-privacy-concerns/ (accessed on 18 October 2025).
- Google Catches Bing Copying; Microsoft Says ‘So What?’. Available online: https://www.wired.com/2011/02/bing-copies-google/ (accessed on 18 October 2025).
- Parliament, E.; The Council of the European Union. REGULATION (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the Protection of Natural Persons with Regard to the Processing of Personal Data and on the Free Movement of Such Data, and Repealing Directive 95/46/EC (General Data Protection Regulation). Available online: https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32016R0679 (accessed on 18 October 2025).
- Yelp Open Dataset. Available online: https://business.yelp.com/data/resources/open-dataset/ (accessed on 18 October 2025).
- European Data—Open Data and Tourism. Available online: https://data.europa.eu/en/publications/datastories/open-data-and-tourism (accessed on 18 October 2025).
- ACHE Crawler Documentation. Available online: https://ache.readthedocs.io/en/latest/ (accessed on 18 October 2025).
- Vieira, K.; Barbosa, L.; da Silva, A.S.; Freire, J.; de Moura, E.S. Finding seeds to bootstrap focused crawlers. World Wide Web 2016, 19, 449–474. [Google Scholar] [CrossRef]
- Deligiannis, K.; Raftopoulou, P.; Tryfonopoulos, C.; Vassilakis, C. A System for Collecting, Managing, Analyzing and Sharing Diverse, Multi-Faceted Cultural Heritage and Tourism Data. In Proceedings of the 16th International Workshop on Semantic and Social Media Adaptation & Personalization, SMAP 2021, Corfu, Greece, 4–5 November 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–8. [Google Scholar] [CrossRef]
- Hurwit, J.M. The Athenian Acropolis: History, Mythology, and Archaeology from the Neolithic Era to the Present; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
- Androutsopoulos, J. ‘Greeklish’: Transliteration practice and discourse in the context of computer-mediated digraphia. In Standard Languages and Language Standards–Greek, Past and Present; Routledge: London, UK, 2016; pp. 249–278. [Google Scholar]
- Papadakis, G.; Skoutas, D.; Thanos, E.; Palpanas, T. A Survey of Blocking and Filtering Techniques for Entity Resolution. arXiv 2020, arXiv:1905.06167. [Google Scholar] [CrossRef]
- Greenspan, M.; Yurick, M. Approximate kd tree search for efficient ICP. In Proceedings of the Fourth International Conference on 3-D Digital Imaging and Modeling, 2003, 3DIM 2003, Proceedings, Banff, AB, Canada, 6–10 October 2003; IEEE: Piscataway, NJ, USA, 2003; pp. 442–448. [Google Scholar]
- Foley, T.; Sugerman, J. KD-tree acceleration structures for a GPU raytracer. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, Los Angeles, CA, USA, 30–31 July 2005; pp. 15–22. [Google Scholar]
- SpaCyTextBlob, A TextBlob Sentiment Analysis Pipeline Component for spaCy. Available online: https://spacy.io/universe/project/spacy-textblob (accessed on 18 October 2025).
- Deep-Translator, A Flexible Free and Unlimited Python Tool to Translate Between Different Languages in a Simple Way Using Multiple Translators. Available online: https://deep-translator.readthedocs.io/en/latest/index.html (accessed on 18 October 2025).
- Paga, J.; Miles, M.M. The archaic temple of Poseidon at Sounion. Hesperia J. Am. Sch. Class. Stud. Athens 2016, 85, 657–710. [Google Scholar]
- Hagging Face Transformers, State-of-the-Art Machine Learning for PyTorch, TensorFlow, and JAX. Available online: https://huggingface.co/docs/transformers/index (accessed on 18 October 2025).
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar] [CrossRef]
- Dbpedia Ontology. Available online: https://dbpedia.org/ontology/ (accessed on 18 October 2025).
- Mendes, P.N.; Jakob, M.; García-Silva, A.; Bizer, C. DBpedia Spotlight: Shedding Light on the Web of Documents. In Proceedings of the 7th International Conference on Semantic Systems, I-Semantics ’11, Graz, Austria, 7–9 September 2011; Association for Computing Machinery: New York, NY, USA, 2011; pp. 1–8. [Google Scholar] [CrossRef]
- SpaCy Entity Ruler, Pipeline Component for Rule-Based Named Entity Recognition. Available online: https://spacy.io/api/entityruler (accessed on 18 October 2025).
- Language Class, a Text-Processing Pipeline for SpaCy. Available online: https://spacy.io/api/language (accessed on 18 October 2025).
- Doc Class, a Container for Accessing Linguistic Annotations. Available online: https://spacy.io/api/doc (accessed on 18 October 2025).
- Sokolova, M.; Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
- Grinberg, M. Flask Web Development: Developing Web Applications with Python; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2018. [Google Scholar]
- Google Maps Directions. Available online: https://developers.google.com/maps/documentation/urls/get-started (accessed on 18 October 2025).
- Nominatim, Open-Source Geocoding with OpenStreetMap Data. Available online: https://nominatim.org/ (accessed on 18 October 2025).
- OpenStreetMap API v0.6. Available online: https://wiki.openstreetmap.org/wiki/API_v0.6 (accessed on 18 October 2025).
- MongoDB—Sharding. Available online: https://www.mongodb.com/docs/manual/sharding/ (accessed on 18 October 2025).











| Id | Name | Category |
|---|---|---|
| 7408 | Acropolis, Athens Greece | Arts and Entertainment |
| 7642 | Atina Akrapolis | Arts and Entertainment |
| 7753 | אקרופוליס אתונה | Arts and Entertainment |
| 13618 | Athena, Acropolis | Hotels |
| 14029 | Acropole d’Athènes | Landmarks |
| 14037 | Vraxakia Acropole | Landmarks |
| 14071 | Akropoli | Landmarks |
| 14077 | Acropolis (Athens) | Landmarks |
| 14078 | The Acropolis, Athens | Landmarks |
| 14307 | Grecja, Ateny, Akropol | Landmarks |
| 14308 | 雅典 城 | Landmarks |
| 14328 | 城 雅典 | Landmarks |
| 14379 | The Acropolis | Landmarks |
| 14403 | The Acropolis | Landmarks |
| 14465 | North and east slope of Acropolis | Landmarks |
| 14754 | Aκρόπολη Aθήνα | Museums |
| 14765 | Acropolis, Athens Greece | Museums |
| 14817 | Athenean Acropolis | Museums |
| 15197 | Akropolis Athena | Parks and Outdoors |
| # | Measure | Threshold |
|---|---|---|
| 1 | Damerau-Levenshtein | 0.55 |
| 2 | Jaro | 0.75 |
| 3 | Jaro-Winkler | 0.70 |
| 4 | Jaro-Winkler Reversed | 0.75 |
| 5 | Sorted Jaro-Winkler | 0.70 |
| 6 | Cosine N-Grams | 0.40 |
| 7 | Jaccard N-Grams | 0.25 |
| 8 | Jaccard Skip-grams | 0.45 |
| 9 | Dice Bi-Grams | 0.50 |
| Attribute | Entry 1 | Entry 2 | Entry 3 | Entry 4 |
|---|---|---|---|---|
| id | 17 | 2507 | 2959 | 5355 |
| name | Minnie the | Mia zwi | Pizzoteca | Tholos |
| Moocher | tinexoume | Nel Pireo | Cafe | |
| lon | 23.74 | 23.71 | 23.64 | 23.71 |
| lat | 37.97 | 37.98 | 37.94 | 37.97 |
| cand_id | 2878 | 3831 | 5371 | 4371 |
| cand_name | Minnie The | Μια ζωή | Gilly | Μέντωρ |
| Moocher-Bar | την έχουμε | Nel Pireo | Cafe | |
| cand_lon | 23.74 | 23.71 | 23.64 | 23.71 |
| cand_lat | 37.97 | 37.98 | 37.94 | 37.97 |
| Damerau_Levenstein | 0.82 | 0.68 | 0.55 | 0.54 |
| Jaro | 0.94 | 0.76 | 0.67 | 0.75 |
| Jaro_Winkler | 0.96 | 0.85 | 0.67 | 0.75 |
| Jaro_Winkler_Rev | 0.83 | 0.84 | 0.60 | 0.85 |
| Sorted_Jaro_Winkler | 0.85 | 0.76 | 0.60 | 0.85 |
| Cosine_N_Grams | 0.92 | 0.50 | 0.57 | 0.39 |
| Jac_N_Grams | 0.83 | 0.33 | 0.38 | 0.24 |
| Dice_Bi_Grams | 0.88 | 0.51 | 0.56 | 0.4 |
| Jac_Skip_grams | 0.87 | 0.50 | 0.54 | 0.43 |
| Damerau_Levenstein_status | 1 | 1 | 1 | 0 |
| Jaro_status | 1 | 1 | 0 | 1 |
| Jaro_Winkler_Rev_status | 1 | 1 | 0 | 1 |
| reverse_winkler_status | 1 | 1 | 0 | 1 |
| Sorted_Jaro_Winkler_status | 1 | 1 | 0 | 1 |
| Cosine_N_Grams_status | 1 | 1 | 1 | 0 |
| Jac_N_Grams_status | 1 | 1 | 1 | 0 |
| Dice_Bi_Grams_status | 1 | 1 | 1 | 0 |
| Jac_Skip_grams_status | 1 | 1 | 1 | 0 |
| on_translit | 0 | 1 | 0 | 1 |
| status | 1 | 1 | 1 | 0 |
| Source | # of Venues |
| 10,405 | |
| TripAdvisor | 6869 |
| Odysseus | 147 |
| Before/After | # of Venues |
| Total PoIs before homogenization | 17,421 |
| Retained PoIs after homogenization | 7445 |
| Source(s) | Overlap |
|---|---|
| All three sources | 0.04% |
| Facebook + TripAdvisor | 72.23% |
| Facebook + Odysseus | 0% |
| TripAdvisor + Odysseus | 0.09% |
| Facebook (exclusively) | 10.64% |
| TripAdvisor (exclusively) | 15.96% |
| Odysseus (exclusively) | 1.04% |
| Source | Name_gr | Name_en | Categories | Address | Phone | Site | Coordinates | Days_Hours_Open | |
|---|---|---|---|---|---|---|---|---|---|
| 51.85% | 48.15% | 100% | 94.11% a | 67.54% | 39.96% | 92.53% | 36.04% | 18.31% | |
| TripAdvisor | 50.59% | 49.41% | 93.13% | 99.52% | 91.63% | - | - | - | - |
| Odysseus | 100% | 100% | 100% | Sporadic appearances, not structured | 32.78% | - | 100% | 67.26% | 28.92% |
| NER Model | Precision | Recall | F-Measure |
|---|---|---|---|
| SpaCy_base_NER_(SM) | 0.36 | 0.32 | 0.33 |
| SpaCy_base_NER_(MD) | 0.50 | 0.38 | 0.42 |
| SpaCy_base_NER_(LG) | 0.61 | 0.54 | 0.56 |
| SpaCy_base_NER_(TRF) | 0.64 | 0.58 | 0.60 |
| Enhanced_NER_(SM) | 0.86 | 0.84 | 0.84 |
| Enhanced_NER_(MD) | 0.89 | 0.88 | 0.87 |
| Enhanced_NER_(LG) | 0.86 | 0.86 | 0.85 |
| Enhanced_NER_(TRF) | 0.90 | 0.88 | 0.89 |
| NER Model | Correctly Identified | Falsely Identified | Not Identified |
|---|---|---|---|
| SpaCy_base_NER_(SM) | 38.82% | 47.65% | 20.59% |
| SpaCy_base_NER_MD) | 45.63% | 25.63% | 33.13% |
| SpaCy_base_NER_(LG) | 59.17% | 28.99% | 16.57% |
| SpaCy_base_NER_(TRF) | 60.12% | 20.83% | 21.43% |
| Enhanced_NER_(SM) | 84.12% | 16.47% | 10.00% |
| Enhanced_NER_(MD) | 87.5% | 12.5% | 7.71% |
| Enhanced_NER_(LG) | 88.02% | 14.97% | 5.39% |
| Enhanced_NER_(TRF) | 87.43% | 7.78% | 10.78% |
| Geocoding Radius Threshold | # of Routes Extracted |
|---|---|
| 0.05 | 263 |
| 0.1 | 287 |
| 0.15 | 297 |
| 0.2 | 304 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Deligiannis, K.; Tryfonopoulos, C.; Raftopoulou, P.; Vassilakis, C.; Kaffes, V.; Skiadopoulos, S. A Machine Learning Framework for Harvesting and Harmonizing Cultural and Touristic Data. Information 2025, 16, 1038. https://doi.org/10.3390/info16121038
Deligiannis K, Tryfonopoulos C, Raftopoulou P, Vassilakis C, Kaffes V, Skiadopoulos S. A Machine Learning Framework for Harvesting and Harmonizing Cultural and Touristic Data. Information. 2025; 16(12):1038. https://doi.org/10.3390/info16121038
Chicago/Turabian StyleDeligiannis, Kimon, Christos Tryfonopoulos, Paraskevi Raftopoulou, Costas Vassilakis, Vassilis Kaffes, and Spiros Skiadopoulos. 2025. "A Machine Learning Framework for Harvesting and Harmonizing Cultural and Touristic Data" Information 16, no. 12: 1038. https://doi.org/10.3390/info16121038
APA StyleDeligiannis, K., Tryfonopoulos, C., Raftopoulou, P., Vassilakis, C., Kaffes, V., & Skiadopoulos, S. (2025). A Machine Learning Framework for Harvesting and Harmonizing Cultural and Touristic Data. Information, 16(12), 1038. https://doi.org/10.3390/info16121038

