A Methodology for Knowledge Discovery in Labeled and Heterogeneous Graphs
Abstract
1. Introduction
- A new and specific methodology named KDG (knowledge discovery in graphs) to guide users to find insights from data represented as graphs.
- Three use cases applying the proposed methodology.
2. Related Work
2.1. Frameworks and Methodologies
2.2. Tools and Algorithms for Graph Analysis
3. Related Concepts
3.1. Graphs
3.2. Graph Structure
3.3. Graph Visualization
4. Methodology Proposal
4.1. Stage 1: Understanding the Analytical Process
4.2. Stage 2: Graph Building
4.3. Stage 3: Graph Mining
4.4. Stage 4: Graph Transformation
4.5. Stage 5: Graph Visualization
4.6. Stage 6: Evaluation
5. Case Studies
5.1. Product Recommendation
5.2. Airports
5.3. Path Recommendation
6. Discussion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| SQL | Structured Query Language | 
| NOSQL | Not Only Structured Query Language | 
| CRISP-DM | Cross-Industry Standard Process for Data Mining | 
| BIM | Business Intelligence Model | 
| MLOps | Machine Learning Operations | 
| OSM | Open Street Maps | 
| CSV | Comma Separated Values | 
| GDB | Graph Database | 
| RDB | Relational Database | 
| SMART | Specific, Measurable, Achievable, Relevant, and Time-bound | 
| KDD | Knowledge Discovery in Databases | 
| SEMMA | Sample, Explore, Modify, Model, and Assess | 
| DST | Decision Support Tool | 
| EDA | Exploratory Data Analysis | 
| PDA | Predictive Data Analysis | 
| ASUM | Analytics Solutions Unified Method | 
| CASP-DM | Context-Aware Standard Process for Data Mining | 
| FMDS | Foundational Methodology for Data Science | 
| TDSP | Team Data Science Process | 
| DST | Data Science Trayectory | 
| KDG | Knowledge Discovery in Graphs | 
References
- Fernandes, D.; Bernardino, J. Graph Databases Comparison: AllegroGraph, ArangoDB, InfiniteGraph, Neo4J, and OrientDB. In Proceedings of the 7th International Conference on Data Science, Technology and Applications (DATA 2018), Volterra, Italy, 13–16 September 2018; pp. 373–380. [Google Scholar]
- Lysenko, A.; Roznovăţ, I.A.; Saqi, M.; Mazein, A.; Rawlings, C.J.; Auffray, C. Representing and querying disease networks using graph databases. BioData Min. 2016, 9, 1–19. [Google Scholar] [CrossRef]
- Doğan, B. The Importance of Graph Databases in Detection of Organized Financial Crimes. In The Impact of Artificial Intelligence on Governance, Economics and Finance; Springer: Berlin/Heidelberg, Germany, 2022; Volume 2, pp. 147–155. [Google Scholar]
- Czerepicki, A. Application of graph databases for transport purposes. Bull. Pol. Acad. Sci. Tech. Sci. 2016, 64, 457–466. [Google Scholar] [CrossRef][Green Version]
- Sayeb, Y.; Jebri, M.; Ghezala, H.B. A graph based recommender system for managing COVID-19 Crisis. Procedia Comput. Sci. 2022, 196, 348–355. [Google Scholar] [CrossRef] [PubMed]
- Fayyad, U.M.; Piatetsky-Shapiro, G.; Smyth, P. Knowledge Discovery and Data Mining: Towards a Unifying Framework. Proceedings of Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, OR, USA, 2–4 August 1996; Volume 96, pp. 82–88. [Google Scholar]
- Wirth, R.; Hipp, J. CRISP-DM: Towards a standard process model for data mining. In Proceedings of the 4th International Conference on the Practical Applications of Knowledge Discovery and Data Mining, Manchester, UK, 11–13 April 2000; Volume 1, pp. 29–39. [Google Scholar]
- Sarma, K.S. Predictive Modeling with SAS Enterprise Miner: Practical Solutions for Business Applications; SAS Institute: Cary, NC, USA, 2017. [Google Scholar]
- Chakrabarti, D. Graph Mining. In Encyclopedia of Machine Learning; Sammut, C., Webb, G.I., Eds.; Springer: Boston, MA, USA, 2010; pp. 469–471. [Google Scholar] [CrossRef]
- IBM Analytics Solutions Unified Method (ASUM). Available online: http://gforge.icesi.edu.co/ASUM-DM_External/index.htm#cognos.external.asum-DM_Teaser/deliveryprocesses/ASUM-DM_8A5C87D5.html_desc.html?proc=_0eKIHlt6EeW_y7k3h2HTng&path=_0eKIHlt6EeW_y7k3h2HTng (accessed on 14 September 2023).
- Martínez-Plumed, F.; Ochando, L.C.; Ferri, C.; Flach, P.A.; Hernández-Orallo, J.; Kull, M.; Lachiche, N.; Ramírez-Quintana, M.J. CASP-DM: Context Aware Standard Process for Data Mining. CoRR 2017, arXiv:1709.09003. [Google Scholar]
- Foundational Methodology for Data Science. Available online: https://www.ibm.com/downloads/cas/WKK9DX51 (accessed on 14 September 2023).
- Team Data Science Process. Available online: https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process (accessed on 14 September 2023).
- Martínez-Plumed, F.; Contreras-Ochando, L.; Ferri, C.; Hernández-Orallo, J.; Kull, M.; Lachiche, N.; Ramirez-Quintana, M.J.; Flach, P. CRISP-DM twenty years later: From data mining processes to data science trajectories. IEEE Trans. Knowl. Data Eng. 2019, 33, 3048–3061. [Google Scholar] [CrossRef]
- Studer, S.; Bui, T.B.; Drescher, C.; Hanuschkin, A.; Winkler, L.; Peters, S.; Müller, K.R. Towards CRISP-ML (Q): A machine learning process model with quality assurance methodology. Mach. Learn. Knowl. Extr. 2021, 3, 392–413. [Google Scholar] [CrossRef]
- Horkoff, J.; Barone, D.; Jiang, L.; Yu, E.; Amyot, D.; Borgida, A.; Mylopoulos, J. Strategic business modeling: Representation and reasoning. Softw. Syst. Model. 2014, 13, 1015–1041. [Google Scholar] [CrossRef]
- Kumar, D.; Showrov, M.I.H. A data mining framework for social graph generation and analysis. In Proceedings of the 2nd International Conference on Innovation in Engineering and Technology (ICIET), Harbin, China, 20–22 January 2019; pp. 1–6. [Google Scholar]
- Pienta, R.; Hohman, F.; Endert, A.; Tamersoy, A.; Roundy, K.; Gates, C.; Navathe, S.; Chau, D.H. VIGOR: Interactive visual exploration of graph query results. IEEE Trans. Vis. Comput. Graph. 2017, 24, 215–225. [Google Scholar] [CrossRef] [PubMed]
- Bok, K.; Yoo, S.; Choi, D.; Lim, J.; Yoo, J. In-Memory Caching for Enhancing Subgraph Accessibility. Appl. Sci. 2020, 10, 5507. [Google Scholar] [CrossRef]
- Chen, C.; Yan, X.; Zhu, F.; Han, J.; Philip, S.Y. Graph OLAP: A multi-dimensional framework for graph data analysis. Knowl. Inf. Syst. 2009, 21, 41–63. [Google Scholar] [CrossRef]
- Mcgee, F.; Ghoniem, M.; Melançon, G.; Otjacques, B.; Pinaud, B. The state of the art in multilayer network visualization. In Computer Graphics Forum; Wiley Online Library: Hoboken, NJ, USA, 2019; Volume 38, pp. 125–149. [Google Scholar]
- Nararatwong, R.; Kertkeidkachorn, N.; Ichise, R. Knowledge graph visualization: Challenges, framework, and implementation. In Proceedings of the IEEE 3rd International Conference on Artificial Intelligence and Knowledge Engineering (AIKE), Laguna Hills, CA, USA, 9–11 December 2020; pp. 174–178. [Google Scholar]
- Shchur, O.; Mumme, M.; Bojchevski, A.; Günnemann, S. Pitfalls of graph neural network evaluation. arXiv 2018, arXiv:1811.05868. [Google Scholar]
- Alshahrani, M.; Thafar, M.A.; Essack, M. Application and evaluation of knowledge graph embeddings in biomedical data. PeerJ Comput. Sci. 2021, 7, e341. [Google Scholar] [PubMed]
- Shrivastava, S.; Pal, S.N. Graph mining framework for finding and visualizing substructures using graph database. In Proceedings of the International Conference on Advances in Social Network Analysis and Mining, Athens, Greece, 20–22 July 2009; pp. 379–380. [Google Scholar]
- Nasiri, A.; Nalchigar, S.; Yu, E.; Ahmed, W.; Wrembel, R.; Zimanyi, E. From indicators to predictive analytics: A conceptual modelling framework. In Proceedings of the IFIP Working Conference on The Practice of Enterprise Modeling, Leuven, Belgium, 22–24 November 2017; pp. 171–186. [Google Scholar]
- Yu, E. Modeling Strategic Relationships for Process Reengineering. Soc. Model. Requir. Eng. 2011, 11, 66–87. [Google Scholar]
- Schroeder, D.T.; Pogorelov, K.; Langguth, J. Fact: A framework for analysis and capture of twitter graphs. In Proceedings of the 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS), Granada, Spain, 22–25 October 2019; pp. 134–141. [Google Scholar]
- Qiao, F.; Zhang, X.; Li, P.; Ding, Z.; Jia, S.; Wang, H. A parallel approach for frequent subgraph mining in a single large graph using spark. Appl. Sci. 2018, 8, 230. [Google Scholar] [CrossRef]
- Zhang, J.; Li, T.; Jiang, Z.; Hu, X.; Jazayeri, A. A Noval Weighted Meta Graph Method for Classification in Heterogeneous Information Networks. Appl. Sci. 2020, 10, 1603. [Google Scholar] [CrossRef]
- Lee, K.; Jung, H.; Hong, J.S.; Kim, W. Learning Knowledge Using Frequent Subgraph Mining from Ontology Graph Data. Appl. Sci. 2021, 11, 932. [Google Scholar] [CrossRef]
- Dunne, C.; Shneiderman, B. Motif simplification: Improving network visualization readability with fan, connector, and clique glyphs. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Paris, France, 27 April–2 May 2013; pp. 3247–3256. [Google Scholar]
- West, D.B. Introduction to Graph Theory; Prentice Hall: Upper Saddle River, NJ, USA, 2001; Volume 2. [Google Scholar]
- Robinson, I.; Webber, J.; Eifrem, E. Graph Databases: New Opportunities for Connected Data; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2015. [Google Scholar]
- Les MacLeod EdD. Making SMART goals smarter. Physician Exec. 2012, 38, 68. [Google Scholar]
- ISO/IEC/IEEE 29148:2018(E); ISO/IEC/IEEE International Standard-Systems and Software Engineering–Life Cycle Processes–Requirements Engineering. IEEE: Piscataway, NJ, USA, 2018; pp. 1–104. [CrossRef]
- Lovett, J. Social Media Metrics Secrets; John Wiley & Sons: Hoboken, NJ, USA, 2011. [Google Scholar]
- Pendleton, M.; Garcia-Lebron, R.; Cho, J.H.; Xu, S. A survey on systems security metrics. ACM Comput. Surv. (CSUR) 2016, 49, 1–35. [Google Scholar] [CrossRef]
- Reich, B.H.; Wee, S.Y. Searching for Knowledge in the PMBOK® Guide. Proj. Manag. J. 2006, 37, 11–26. [Google Scholar] [CrossRef]
- Hammond, J.S.; Keeney, R.L.; Raiffa, H. Smart Choices: A Practical Guide to Making Better Decisions; Harvard Business Review Press: Brighton, MA, USA, 2015. [Google Scholar]
- Bowell, T.; Kemp, G. Critical Thinking: A Concise Guide; Routledge: London, UK, 2014. [Google Scholar]
- Kojima, R.; Legaspi, R.; Wada, S. Trip Destination Prediction by Cross-City Exploratory Data Analysis Approach in People Flow Data. In Proceedings of the IEEE International Conference on Big Data (Big Data), Osaka, Japan, 17–20 December 2022; pp. 6547–6552. [Google Scholar] [CrossRef]
- Fuentes, A. Become a Python Data Analyst: Perform Exploratory Data Analysis and Gain Insight into Scientific Computing Using Python; Packt Publishing Ltd.: Birmingham, UK, 2018. [Google Scholar]
- Uzhga-Rebrov, O.; Grabusts, P. Comparative Evaluation of Four Methods for Exploratory Data Analysis. In Proceedings of the 2021 62nd International Scientific Conference on Information Technology and Management Science of Riga Technical University (ITMS), Riga, Latvia, 14–15 October 2021; pp. 1–5. [Google Scholar] [CrossRef]
- Mostajabi, F.; Safaei, A.A.; Sahafi, A. A Systematic Review of Data Models for the Big Data Problem. IEEE Access 2021, 9, 128889–128904. [Google Scholar] [CrossRef]
- Lal, M. Neo4j Graph Data Modeling; Packt Publishing Ltd.: Birmingham, UK, 2015. [Google Scholar]
- Ortega, V.; Ruiz, L.; Gutierrez, L.; Cervantes, F. A selection process of graph databases based on business requirements. In Proceedings of the International Conference on Software Process Improvement, Leon, Mexico, 23–25 October 2019; pp. 80–90. [Google Scholar]
- Bansal, S.K.; Kagemann, S. Integrating Big Data: A Semantic Extract-Transform-Load Framework. Computer 2015, 48, 42–50. [Google Scholar] [CrossRef]
- Maria Carina, R. Learning Pentaho Data Integration 8 CE-Third Edition: Get Up and Running with the Pentaho Data Integration Tool Using This Hands-On, Easy-to-Read Guide; Packt Publishing: Birmingham, UK, 2017. [Google Scholar]
- Tirthajyoti, S.; Shubhadeep, R. Data Wrangling with Python: Creating Actionable Data From Raw Sources; Packt Publishing: Birmingham, UK, 2019. [Google Scholar]
- Koutra, D.; Faloutsos, C. Individual and Collective Graph Mining: Principles, Algorithms, and Applications; Springer Nature: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
- Needham, M.; Hodler, A.E. Graph Algorithms: Practical Examples in Apache Spark and Neo4j; O’Reilly Media: Sebastopol, CA, USA, 2019. [Google Scholar]
- Chintalapudi, S.R.; Prasad, M.H.M.K. A survey on community detection algorithms in large scale real world networks. In Proceedings of the 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 11–13 March 2015; pp. 1323–1327. [Google Scholar]
- Buttler, D. A Short Survey of Document Structure Similarity Algorithms; Technical Report; Lawrence Livermore National Lab. (LLNL): Livermore, CA, USA, 2004. [Google Scholar]
- Lawande, S.R.; Jasmine, G.; Anbarasi, J.; Izhar, L.I. A Systematic Review and Analysis of Intelligence-Based Pathfinding Algorithms in the Field of Video Games. Appl. Sci. 2022, 12, 5499. [Google Scholar] [CrossRef]
- Barabási, A.L. Network science. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2013, 371, 20120375. [Google Scholar] [CrossRef] [PubMed]
- Liu, Y.; Safavi, T.; Dighe, A.; Koutra, D. Graph summarization methods and applications: A survey. ACM Comput. Surv. (CSUR) 2018, 51, 1–34. [Google Scholar] [CrossRef]
- Erciyes, K. Complex Networks: An Algorithmic Perspective; CRC Press: Boca Raton, FL, USA, 2014. [Google Scholar]
- Cherven, K. Mastering Gephi Network Visualization; Packt Publishing Ltd.: Birmingham, UK, 2015. [Google Scholar]
- Ward, M.O.; Grinstein, G.; Keim, D. Interactive Data Visualization: Foundations, Techniques, and Applications; CRC Press: Boca Raton, FL, USA, 2010. [Google Scholar]
- Dileep, S.; Manoj, R.; Adarsh, M.; Harikumar, S. Comparing the Effectiveness of Data Visualization Techniques for Discovering Disease Relationships in a Complex Network Dataset. In Proceedings of the 2023 7th International Conference on Trends in Electronics and Informatics (ICOEI), Tirunelveli, India, 11–13 April 2023; pp. 1486–1492. [Google Scholar] [CrossRef]
- Wajahat, A.; Nazir, A.; Akhtar, F.; Qureshi, S.; Ullah, F.; Razaque, F.; Shakeel, A. Interactively Visualize and Analyze Social Network Gephi. In Proceedings of the 3rd International Conference on Computing, Mathematics and Engineering Technologies (iCoMET), Sukkur, Pakistan, 29–30 January 2020; pp. 1–9. [Google Scholar] [CrossRef]
- Chaudhary, A.; Jain, N.; Kumar, A. Tools for Social Network Analysis and Mining. In Proceedings of the 11th International Conference on System Modeling & Advancement in Research Trends (SMART), Moradabad, India, 16–17 December 2022; pp. 1063–1067. [Google Scholar] [CrossRef]
- Islam, M.; Jin, S. An Overview of Data Visualization. In Proceedings of the International Conference on Information Science and Communications Technologies (ICISCT), Karachi, Pakistan, 9–10 March 2019; pp. 1–7. [Google Scholar] [CrossRef]
- OpenStreetMap. 2023. Available online: https://www.openstreetmap.org (accessed on 14 September 2023).














| KDG Stages | KDG Tasks | CRISP DM [7] | DST [14] | KDD [6] | 
|---|---|---|---|---|
| 1. Understanding the Analytical Process | 1.1 Identify business goals | Business Understanding | Goal Exploration | Developing an understanding of the application domain and the relevant prior knowledge, and identifying the goal of the KDD process from the customer’s viewpoint. | 
| 1.2 Identify information requirements | Business Understanding | Business Understanding | - | |
| 1.3 Identify successful metrics | Business Understanding | Business Understanding | - | |
| 1.4 Identify the project plan | Business Understanding | Business Understanding | - | |
| 1.5 Define questions for decision making | Business Understanding | Data value exploration | - | |
| 2. Graph Building | 2.1 Exploratory data analysis | Data Understanding | Data source exploration | Creating a target dataset | 
| 2.2 Design the graph model | - | - | - | |
| 2.3 Select the data storage format | Data Preparation | Data Preparation | Creating a target dataset | |
| 2.4 Extract, transform, and load data | Data Preparation | Data Preparation | Data cleaning and preprocessing. Data reduction and projection | |
| 3. Graph Mining | 3.1 Select graph mining algorithms | Modeling * | Modeling * | Matching the goals of tile KDD process to particular data mining method. Choosing the data mining algorithms. Data mining * | 
| 3.2 Apply centrality algorithms | - | - | - | |
| 3.3 Apply community algorithms | - | - | - | |
| 3.4 Apply similarity algorithms | - | - | - | |
| 3.5 Apply pathfinding algorithms | - | - | - | |
| 4. Graph Transformation | 4.1 Modify topology | - | - | - | 
| 4.2 Include structural attributes | - | - | - | |
| 4.3 Apply summarization | - | - | - | |
| 4.4 Generate subgraphs | - | - | - | |
| 5. Graph Visualization | 5.1 Apply a layout | - | - | - | 
| 5.2 Perform drill-down or roll-up operations | - | - | - | |
| 5.3 Define graph display rules | - | - | - | |
| 5.4 Highlight structural and attribute patterns | - | - | - | |
| 5.5 Use the timeline | - | - | - | |
| 5.6 Represent the graph results | - | Narrative exploration | Interpreting mined patterns | |
| 6. Evaluation | 6.1 Evaluate results | Evaluation | Result exploration | Consolidating discovered knowledge | 
| 6.2 Documentation of results | Evaluation | Product exploration | Consolidating discovered knowledge | |
| 6.3 Determine the next steps | Evaluation | Evaluation | - | 
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. | 
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ortega-Guzmán, V.H.; Gutiérrez-Preciado, L.; Cervantes, F.; Alcaraz-Mejia, M. A Methodology for Knowledge Discovery in Labeled and Heterogeneous Graphs. Appl. Sci. 2024, 14, 838. https://doi.org/10.3390/app14020838
Ortega-Guzmán VH, Gutiérrez-Preciado L, Cervantes F, Alcaraz-Mejia M. A Methodology for Knowledge Discovery in Labeled and Heterogeneous Graphs. Applied Sciences. 2024; 14(2):838. https://doi.org/10.3390/app14020838
Chicago/Turabian StyleOrtega-Guzmán, Víctor H., Luis Gutiérrez-Preciado, Francisco Cervantes, and Mildreth Alcaraz-Mejia. 2024. "A Methodology for Knowledge Discovery in Labeled and Heterogeneous Graphs" Applied Sciences 14, no. 2: 838. https://doi.org/10.3390/app14020838
APA StyleOrtega-Guzmán, V. H., Gutiérrez-Preciado, L., Cervantes, F., & Alcaraz-Mejia, M. (2024). A Methodology for Knowledge Discovery in Labeled and Heterogeneous Graphs. Applied Sciences, 14(2), 838. https://doi.org/10.3390/app14020838
 
        

 
       