An FDA-Based Approach for Clustering Elicited Expert Knowledge
Abstract
:1. Introduction
2. Definitions
2.1. Elicitation of Probability Distributions
2.2. Functional Data Analysis
Functional Clustering
2.3. Proposed Method
- 1.
- For each , , discretize the curve and get a grid of m equidistant points . These points correspond to heights of the curve in m equally spaced points of the support of the function (in this paper, we use 200 and 300 equally spaced points throughout the support of the function. The determination of the amount of points depends on computational capacity). Thus, to apply this methodology, we recommend selecting equidistant points throughout the support of the distributions of interest in order to capture the shape of the curves. We have found that increasing the numbers of points does not have a notorious effect on the outcome of the algorithm).
- 2.
- Compute the Hellinger distance for all possible combinations of two of these functions. So, for curves and , the distance is as follows:
- 3.
- Build a matrix of distances between all curves using the proposed metric.
- 4.
- Use the hierarchical clustering function hclust in R with as the distance matrix.
- 5.
- Obtain the corresponding dendrogram.
- 6.
- Specify the number of clusters and identify the members in each cluster.
3. Simulation Study
- (a)
- ∧
- (b)
- ∧ .
- a: the number of pairs of elements in S that are in the same set in X and in the same set in Y.
- b: the number of pairs of elements in S where both elements belong to different clusters in both partitions.
- c: the number of pairs of elements in S where both elements belong to the same cluster in partition X but not in partition Y.
- d: the number of pairs of elements in S that are in different sets in X and in the same set in Y.
Results
4. Illustration
Computer’s Lifetime Elicitation
- The functional mean is obtained in each cluster.
- 1000 observations are generated from the distribution that is proportional to the the functional mean in each cluster.
- This distribution is approximated to a gamma distribution using the fitdistr function with the generated samples.
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
CCR | Correct classification rate |
FC | Functional clustering |
FDA | Functional data analysis |
F.M | Fowlkes–Mallows |
ICT | Information and communications technology |
J | Jaccard coefficient |
R | Rand index |
Appendix A
Appendix A.1. Weather Station Density Classification
Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Cluster 5 | |
---|---|---|---|---|---|
Locations | St. Johns | Fredericton | Scheffervll | Churchill | Vancouver |
Halifax | Quebec | Arvida | Uranium Cty | Victoria | |
Sydney | Sherbrooke | Bagottville | Dawson | Pr. Rupert | |
Yarmouth | Montreal | Thunderbay | Yellowknife | ||
Charlottvl | Ottawa | Winnipeg | Iqaluit | ||
Toronto | Calgary | The Pas | Inuvik | ||
London | Pr. George | Regina | Resolute | ||
Kamloops | Pr. Albert | ||||
Edmonton | |||||
Whitehorse |
Appendix Students Data Set
References
- Brown, B. Delphi Process: A Methodology Used for the Elicitation of Opinions of Experts; Document No: P-3925; RAND: Santa Monica, CA, USA, 1968; 15p. [Google Scholar]
- Barrera-Causil, C.J.; Correa, J.C.; Marmolejo-Ramos, F. Experimental Investigation on the Elicitation of Subjective Distributions. Front. Psychol. 2019, 10, 862. [Google Scholar] [CrossRef]
- Nielsen, F.; Nock, R.; Amari, S.I. On Clustering Histograms with k-Means by Using Mixed α-Divergences. Entropy 2014, 16, 3273–3301. [Google Scholar] [CrossRef] [Green Version]
- Henderson, K.; Gallagher, B.; Eliassi-rad, T. EP-MEANS: An Efficient Nonparametric Clustering of Empirical Probability Distributions. In Proceedings of the 30th Annual ACM Symposium on Applied Computing, Salamanca, Spain, 13–17 April 2015; ACM: New York, NY, USA, 2015. [Google Scholar]
- Wang, J.L.; Chiou, J.M.; Müller, H.G. Review of functional data analysis. Annu. Rev. Stat. Appl. 2016, 3, 257–295. [Google Scholar] [CrossRef] [Green Version]
- Ferreira, L.; Hitchcock, D. A comparison of hierarchical methods for clustering functional data. Commun. Stat. Simul. Comput. 2009, 38, 1925–1949. [Google Scholar] [CrossRef] [Green Version]
- Abraham, C.; Cornillon, P.A.; Matzner-Lober, E.; Molinari, N. Unsupervised Curve Clustering Using B-Splines. Scand. J. Stat. 2003, 30, 581–595. [Google Scholar] [CrossRef]
- Gareth, M.; Sugar, C. Clustering for Sparsely Sampled Functional Data. J. Am. Stat. Assoc. 2003, 98, 397–408. [Google Scholar]
- Serban, N.; Wasserman, L. CATS: Clustering after Transformation and Smoothing. J. Am. Stat. Assoc. 2005, 100, 990–999. [Google Scholar] [CrossRef]
- Shubhankar, R.; Bani, M. Functional Clustering by Bayesian Wavelet Methods. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2006, 68, 305–332. [Google Scholar]
- Song, J.; Lee, H.; Jeffrey, S.M.; Kang, S. Clustering of time-course gene expression data using functional data analysis. Comput. Biol. Chem. 2007, 31, 265–274. [Google Scholar] [CrossRef] [Green Version]
- Chiou, J.M.; Li, P.L. Functional Clustering and Identifying Substructures of Longitudinal Data. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2007, 69, 679–699. [Google Scholar] [CrossRef]
- Tarpey, T. Linear Transformations and the k-Means Clustering Algorithm: Applications to Clustering Curves. Am. Stat. 2007, 61, 34–40. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Goia, A.; May, C.; Fusai, G. Functional clustering and linear regression for peak load forecasting. Int. J. Forecast. 2010, 26, 700–711. [Google Scholar] [CrossRef]
- Hébrail, G.; Hugueney, B.; Lechevallier, Y.; Rossi, F. Exploratory analysis of functional data via clustering and optimal segmentation. Neurocomputing 2010, 73, 1125–1141. [Google Scholar] [CrossRef] [Green Version]
- Boullé, M. Functional data clustering via piecewise constant nonparametric density estimation. Pattern Recognit. 2012, 45, 4389–4401. [Google Scholar] [CrossRef]
- Secchi, P.; Vantini, S.; Vitelli, V. Bagging Voronoi classifiers for clustering spatial functional data. Int. J. Appl. Earth Obs. Geoinf. 2012, 22, 53–64. [Google Scholar] [CrossRef]
- Jacques, J.; Preda, C. Model-based clustering for multivariate functional data. Comput. Stat. Data Anal. 2013, 71, 92–106. [Google Scholar] [CrossRef] [Green Version]
- Jacques, J.; Preda, C. Functional data clustering: A survey. Adv. Data Anal. Classif. 2014, 8, 231–255. [Google Scholar] [CrossRef] [Green Version]
- Stefan, A.; Katsimpokis, D.; Gronau, Q.F.; Wagenmakers, E. Expert agreement in prior elicitation and its effects on bayesian inference. PsyArXiv 2021. [Google Scholar] [CrossRef]
- Simpson, D.G. Minimum Hellinger Distance Estimation for the Analysis of Count Data. J. Am. Stat. Assoc. 1987, 82, 802–807. [Google Scholar] [CrossRef]
- Kahraman, C.; Onar, S.C.; Oztaysi, B. Fuzzy Multicriteria Decision-Making: A Literature Review. Int. J. Comput. Intell. Syst. 2015, 8, 637–666. [Google Scholar] [CrossRef] [Green Version]
- R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2020. [Google Scholar]
- Febrero-Bande, M.; Oviedo, M. Statistical Computing in Functional Data Analysis: The R Package fda.usc. J. Stat. Softw. 2012, 51, 1–28. [Google Scholar] [CrossRef] [Green Version]
- Maechler, M.; Rousseeuw, P.; Struyf, A.; Hubert, M.; Hornik, K. Cluster: Cluster Analysis Basics and Extensions; R Package Version 2.0.3; R Foundation: Vienna, Austria, 2015. [Google Scholar]
- Hubert, L.; Arabie, P. Comparing partitions. J. Classif. 1985, 2, 193–218. [Google Scholar] [CrossRef]
- Morlini, I.; Zani, S. Dissimilarity and similarity measures for comparing dendrograms and their applications. Adv. Data Anal. Classif. 2012, 6, 85–105. [Google Scholar] [CrossRef]
- Steegen, S.; Tuerlinckx, F.; Gelman, A.; Vanpaemel, W. Increasing Transparency Through a Multiverse Analysis. Perspect. Psychol. Sci. 2016, 5, 702–712. [Google Scholar] [CrossRef]
- Ramsay, J.; Hooker, G.; Graves, S. Functional Data Analysis with R and MATLAB; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
- Kaufman, L.; Rousseeuw, P. Finding Groups in Data: An Introduction to Cluster Analysis, 1st ed.; Series in Probability and Statistics; Wiley: Hoboken, NJ, USA, 1990. [Google Scholar]
- Nieweglowski, L. clv: Cluster Validation Techniques; R Package Version 0.3-2.1; R Foundation: Vienna, Austria, 2013. [Google Scholar]
- Barrera, C.; Correa, J. Distribución Predictiva Bayesiana Para Modelos de Pruebas de Vida vía MCMC. Ph.D. Thesis, Universidad Nacional de Colombia Sede Medellín, Medellín, Colombia, 2008. [Google Scholar]
- Shanker, R.; Shukla, K.; Shanker, R.; Leonida, T. On modeling of lifetime data using two-parameter Gamma and Weibull distributions. Biom. Biostat. Int. J. 2016, 4, 201–206. [Google Scholar] [CrossRef]
- Rigby, R.A.; Stasinopoulos, M.D. Generalized additive models for location, scale and shape. J. R. Stat. Soc. Ser. C (Appl. Stat.) 2005, 54, 507–554. [Google Scholar] [CrossRef] [Green Version]
- VerMilyea, M.; Hall, J.; Diakiw, S.; Johnston, A.; Nguyen, T.; Perugini, D.; Miller, A.; Picou, A.; Murphy, A.; Perugini, M. Development of an artificial intelligence-based assessment model for prediction of embryo viability using static images captured by optical light microscopy during IVF. Hum. Reprod. 2020, 4, 770–784. [Google Scholar] [CrossRef] [Green Version]
- Arpaci, I.; Alshehabi, S.; Al-Emran, M.; Khasawneh, M.; Mahariq, I.; Abdeljawad, T.; Hassanien, A. Analysis of twitter data using evolutionary clustering during the COVID-19 Pandemic. Comput. Mater. Contin. 2020, 1, 193–204. [Google Scholar] [CrossRef]
- Sinha, A.; Zhao, H. Incorporating domain knowledge into data mining classifiers: An application in indirect lending. Decis. Support Syst. 2008, 46, 287–299. [Google Scholar] [CrossRef]
- Micallef, L.; Sundin, I.; Marttinen, P.; Ammad-ud din, M.; Peltola, T.; Soare, M.; Jacucci, G.; Kaski, S. Interactive elicitation of knowledge on feature relevance improves predictions in small data sets. In Proceedings of the 22nd International Conference on Intelligent User Interfaces, Limassol, Cyprus, 13–16 March 2017; ACM Press: New York, NY, USA, 2017. [Google Scholar]
- Daee, P.; Peltola, T.; Soare, M.; Kaski, S. Knowledge elicitation via sequential probabilistic inference for high-dimensional prediction. Mach. Learn. 2017, 1599–1620. [Google Scholar] [CrossRef] [Green Version]
- Groznik, V.; Guid, M.; Sadikov, A.; Možina, M.; Georgiev, D.; Kragelj, V.; Ribarič, S.; Pirtošek, Z.; Bratko, I. Elicitation of neurological knowledge with argument-based machine learning. Artif. Intell. Med. 2013, 106, 133–144. [Google Scholar] [CrossRef] [PubMed]
- Ramsay, J.; Silverman, B. Functional Data Analysis, 2nd ed.; Springer: New York, NY, USA, 2005. [Google Scholar]
- Giraldo, R. Geostatiscal Analysis of Functional Data. Ph.D. Thesis, Universitat Politécnica de Catalunya, Catalunya, Spain, 2009. [Google Scholar]
- Giraldo, R.; Delicado, P.; Mateu, J. Hierarchical clustering of spatially correlated functional data. Stat. Neerl. 2012, 66, 403–421. [Google Scholar] [CrossRef] [Green Version]
- Cohen-addad, V.; Kanade, V.; Mallmann-trenn, F.; Mathieu, C. Hierarchical Clustering: Objective Functions and Algorithms. J. ACM 2019, 66, 1–42. [Google Scholar] [CrossRef]
- Stone, C. View of online learning in australian higher education: Opportunities, challenges and transformations. Stud. Success 2019, 10, 1–11. [Google Scholar] [CrossRef] [Green Version]
- Devlin, M. The Typical University Student is no Longer 18, Middle-Class and on Campus. We Need to Change Thinking on “Drop-Outs”. 2017. Available online: http://theconversation.com/the-typical-university-student-is-no-longer-18-middle-class-and-on-campus-we-need-to-change-thinking-on-drop-outs-73509 (accessed on 22 November 2019).
Method | Description |
---|---|
Functional k-means | It is an extension of the traditional k-means clustering algorithm for |
kmeans.fd | functional data analysis. This method uses a special metric for |
functional data. | |
Agglomerative hierarchical | It computes agglomerative hierarchical clustering of the data set using |
clustering Agnes (average method) | the average method, where the distance between two clusters is the |
average of the dissimilarities between the points in one cluster and the | |
points in the other cluster. A complete description of agglomerative | |
nesting (agnes) can be found in chapter 5 of Kaufman and | |
Rousseeuw (1990) [30]. | |
Agglomerative hierarchical | It computes agglomerative hierarchical clustering of the data set using |
clustering Agnes (Ward’s method) | Ward’s method, where the agglomerative criterion is based on the |
optimal value of an objective function, which is usually the sum of | |
squared errors. |
Scenarios | -Overlapping Average | |
---|---|---|
2 Clusters | 3 Clusters | |
Overlapped Normal curves | 0.953 | 0.961 |
Overlapped Gamma curves | 0.974 | 0.969 |
Overlapped Beta curves | 0.955 | 0.959 |
Separated Normal curves | 0.052 | 0.066 |
Separated Gamma curves | 0.078 | 0.071 |
Separated Beta curves | 0.064 | 0.067 |
E1 | E2 | E3 | E4 | E5 | E6 | |
---|---|---|---|---|---|---|
Expertise | 2.5 | 5 | 17 | 17 | 4 | 7 |
Current Job | 0.5 | 1.3 | 4 | 2 | 1 | 2.5 |
14,07 | 17,80 | 19,43 | 21,33 | 24,60 | 28,97 | 29,63 | 33,73 | 37,60 | 37,67 | 40,87 | 52,40 |
53,97 | 60,57 | 64,27 | 65,43 | 65,43 | |||||||
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Barrera-Causil, C.; Correa, J.C.; Zamecnik, A.; Torres-Avilés, F.; Marmolejo-Ramos, F. An FDA-Based Approach for Clustering Elicited Expert Knowledge. Stats 2021, 4, 184-204. https://doi.org/10.3390/stats4010014
Barrera-Causil C, Correa JC, Zamecnik A, Torres-Avilés F, Marmolejo-Ramos F. An FDA-Based Approach for Clustering Elicited Expert Knowledge. Stats. 2021; 4(1):184-204. https://doi.org/10.3390/stats4010014
Chicago/Turabian StyleBarrera-Causil, Carlos, Juan Carlos Correa, Andrew Zamecnik, Francisco Torres-Avilés, and Fernando Marmolejo-Ramos. 2021. "An FDA-Based Approach for Clustering Elicited Expert Knowledge" Stats 4, no. 1: 184-204. https://doi.org/10.3390/stats4010014
APA StyleBarrera-Causil, C., Correa, J. C., Zamecnik, A., Torres-Avilés, F., & Marmolejo-Ramos, F. (2021). An FDA-Based Approach for Clustering Elicited Expert Knowledge. Stats, 4(1), 184-204. https://doi.org/10.3390/stats4010014