EVCA Classifier: A MCMC-Based Classifier for Analyzing High-Dimensional Big Data
Abstract
:1. Introduction
2. Preliminaries
2.1. Bayesian Machine Learning
2.2. Markov Chain Monte Carlo Sampling
2.2.1. Markov Chain
2.2.2. Metropolis–Hastings
- is the target distribution (also known as the posterior distribution in Bayesian machine learning) from which we want to sample.
- is the proposal distribution, which represents the probability of proposing given the current state x.
- If , the proposed value is always accepted.
- Otherwise, if , the proposed value is accepted with the probability and rejected with the probability .
- Bayesian Inference of Posterior Distribution:
- Detailed Balance Equation:
- Acceptance Ratio:
- Transition Probability:
2.2.3. Gibbs Sampling
- The joint distribution of the variables of interest, denoted as , represents the complete probabilistic model capturing the dependencies among the variables:
- The marginal distribution encapsulates the probability distribution of a single variable , obtained by integrating the joint distribution over all other variables :
- To update the i-th variable in the Gibbs sampling process, we sample from its conditional distribution . This conditional distribution is obtained by rearranging the joint distribution equation and dividing it by the marginal distribution of the remaining variables :
- The updated Gibbs sampling equation states that, at each iteration, we sample a new value for variable from its conditional distribution . This ensures that the updated sample vector retains the dependencies between variables as defined by the joint distribution.
- Additionally, the calculation of expectations involves integrating the function with respect to the conditional distribution , providing a means to estimate various quantities of interest based on the updated sample vector:
- By iteratively updating the elements of the sample vector using their respective conditional distributions, Gibbs sampling enables the exploration and approximation of the target distribution, facilitating Bayesian inference and probabilistic modeling tasks:Each element is updated sequentially using the corresponding conditional distribution.
2.3. Hamiltonian Monte Carlo and The No-U-Turn Sampler
3. Related Work
Air Quality Index—AQI
4. Methodology
4.1. Data Selection and Preprocessing
4.2. Bayesian Logistic Regression
4.2.1. Model Definition
4.2.2. MCMC Sampling
4.2.3. Class Prediction on Unseen Data
Prediction for the Sensitivity Analysis
Listing 1: Predictions for Pandas Test Data. |
def predi c t_proba (X, t r a c e ) : |
l i n e a r = np . dot (X, t r a c e [ ‘ ‘ c o e f f s ’ ’ ] . mean ( a x i s = 0) ) |
+ t r a c e [ ‘ ‘ bi a s ’ ’ ] . mean ( ) |
proba = 1 / (1 + np . exp(− l i n e a r ) ) |
return np . column_stack ( ( 1 − proba , proba ) ) |
y_test_pred_proba = predict_proba ( X_tes t , t r a c e ) |
y_tes t_pred = np . argmax ( y_test_pred_proba , a x i s = 1) |
Predictions on the 18-Year Data in Apache Spark
Listing 2: User-defined predict_proba() function in pyspark. |
@udf ( returnType=ArrayType (DoubleType ( ) ) ) |
def predict_proba_udf ( c o e f f s _ l i s t , bias_value , * f e a tur e s ) : |
l i n e a r = sum ( [ f e a tur e s [ i ] * c o e f f s _ l i s t [ i ] |
for i in range ( len ( c o e f f s _ l i s t ) ) ] ) + bias_value |
proba = 1 / (1 + math . exp(− l i n e a r ) ) |
return [1 − proba , proba ] |
Listing 3: Predicted probabilities and classes in pyspark. |
spark_df = spark_df .withColumn ( ‘ ‘ proba ’ ’ , |
predict_proba_udf ( coe f f s _ a r r ay , l i t ( bias_value ) , |
* [ col ( c ) for c in feature_columns ] ) ) |
spark_df=spark_df .withColumn ( ‘ ‘ y_pred ’ ’ , ( col ( ‘ ‘ proba ’ ’ ) [ 1 ] |
>=threshold ) . c a s t (DoubleType ( ) ) ) |
5. Experimental Results
5.1. Sensitivity Analysis
- Tune: MCMC samplers rely on the concept of Markov chains, which converge to the stationary distribution of the defined model. To obtain unbiased samples, each chain should reach convergence. By setting the tuning parameter to a value (e.g., 1000), the chain iterates 1000 times to achieve convergence before sampling from the distribution begins. The default value is 1000.
- Draws: This parameter determines the number of samples to be taken from the model distribution after the tuning process. The default value is 1000.
- Chains: It is recommended that multiple chains (2–4) are run for reliable convergence diagnostics [49]. The number of chains is set to two by default or is equal to the number of available processors.
- Sampling Efficiency: Smaller draws and a tuning value of around 75–100% of the number of samples result in improved sampling and successful predictions on unknown data. Increasing the number of samples beyond a certain point does not provide additional information about the posterior distribution of the model.
- Model Complexity and Feature Set: Models with a smaller feature set tend to perform better than those with a larger set of 11 features. This is expected because sampling from high-dimensional distributions (such as those with more features) implies a higher model complexity and can lead to decreased performance.
- Convergence of the Sampling Algorithm: The sampling algorithm demonstrates convergence for different tuning parameters, as seen in Figure 7, but this does not guarantee successful predictions on unknown data. Convergence refers to the algorithm’s stability and not necessarily to the accuracy of predictions.
- Draw: The number of samples required for MCMC sampling depends on the complexity of the posterior distribution being sampled rather than the number of features. Increasing the number of features generally increases the complexity, requiring more samples for accurate estimation.
- Tune: Larger "tune" values can prolong the adjustment phase, slowing down sampling and potentially leading to overfitting. Higher tune values may also result in samples with high autocorrelation, hindering accurate estimation. Conversely, a lower tune value leads to a shorter fitting phase, incomplete sampling, and biased estimation of the posterior distribution. Here, we check the model bias by comparing the mean of the target and predicted labels, as seen in Table 5.
- Number of Features: The number of features in a model does not necessarily translate to more reliable beliefs. The reliability of beliefs depends on the data themselves and the estimates of the model parameters. While adding more features can increase the available information, it also increases the risk of overfitting. On the contrary, using fewer features may lead to underfitting, where the model is too simplistic and fails to capture the data complexity.
5.2. Predictions in Apache Spark for Different Decision Thresholds
5.3. Bayesian vs. Frequentist Logistic Regression in Apache Spark
6. Conclusions and Future Work
6.1. Conclusions
6.2. Future Work
6.2.1. Environmental Data Analysis Applications
6.2.2. Bayesian Machine Learning and Bayesian Inference
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
API | Application Programming Interface |
AQI | Air Quality Index |
AUC | Area Under the Curve |
FN | False Negative |
FP | False Positive |
HMC | Hamiltonian Monte Carlo |
IQR | Interquartile Range |
MCMC | Markov Chain Monte Carlo |
MH | Metropolis-Hastings |
ML | Machine Learning |
NUTS | No-U-Turn Sampler |
RDD | Resilient Distributed Dataset |
ROC | Receiver Operating Characteristic |
TN | True Negative |
TP | True Positive |
UDF | User-Defined Function |
IoT | Internet of Things |
EI | Environment Information |
IT | Information Technology |
References
- Villanueva, F.; Ródenas, M.; Ruus, A.; Saffell, J.; Gabriel, M.F. Sampling and analysis techniques for inorganic air pollutants in indoor air. Appl. Spectrosc. Rev. 2021, 57, 531–579. [Google Scholar] [CrossRef]
- Martínez Torres, J.; Pastor Pérez, J.; Sancho Val, J.; McNabola, A.; Martínez Comesaña, M.; Gallagher, J. A Functional Data Analysis Approach for the Detection of Air Pollution Episodes and Outliers: A Case Study in Dublin, Ireland. Mathematics 2020, 8, 225. [Google Scholar] [CrossRef] [Green Version]
- Karras, C.; Karras, A.; Avlonitis, M.; Giannoukou, I.; Sioutas, S. Maximum Likelihood Estimators on MCMC Sampling Algorithms for Decision Making. In Proceedings of the Artificial Intelligence Applications and Innovations, AIAI 2022 IFIP WG 12.5 International Workshops, Creta, Greece, 17–20 June 2022; Maglogiannis, I., Iliadis, L., Macintyre, J., Cortez, P., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 345–356. [Google Scholar]
- Wang, G.; Wang, T. Unbiased Multilevel Monte Carlo methods for intractable distributions: MLMC meets MCMC. arXiv 2022, arXiv:2204.04808. [Google Scholar]
- Braham, H.; Berdjoudj, L.; Boualem, M.; Rahmania, N. Analysis of a non-Markovian queueing model: Bayesian statistics and MCMC methods. Monte Carlo Methods Appl. 2019, 25, 147–154. [Google Scholar] [CrossRef]
- Altschuler, J.M.; Talwar, K. Resolving the Mixing Time of the Langevin Algorithm to its Stationary Distribution for Log-Concave Sampling. arXiv 2022, arXiv:2210.08448. [Google Scholar]
- Paguyo, J. Mixing times of a Burnside process Markov chain on set partitions. arXiv 2022, arXiv:2207.14269. [Google Scholar]
- Dymetman, M.; Bouchard, G.; Carter, S. The OS* algorithm: A joint approach to exact optimization and sampling. arXiv 2012, arXiv:1207.0742. [Google Scholar]
- Jaini, P.; Nielsen, D.; Welling, M. Sampling in combinatorial spaces with survae flow augmented mcmc. In Proceedings of the International Conference on Artificial Intelligence and Statistics, PMLR, Virtual, 13–15 April 2021; pp. 3349–3357. [Google Scholar]
- Vono, M.; Paulin, D.; Doucet, A. Efficient MCMC sampling with dimension-free convergence rate using ADMM-type splitting. J. Mach. Learn. Res. 2022, 23, 1100–1168. [Google Scholar]
- Pinski, F.J. A Novel Hybrid Monte Carlo Algorithm for Sampling Path Space. Entropy 2021, 23, 499. [Google Scholar] [CrossRef]
- Beraha, M.; Argiento, R.; Møller, J.; Guglielmi, A. MCMC Computations for Bayesian Mixture Models Using Repulsive Point Processes. J. Comput. Graph. Stat. 2022, 31, 422–435. [Google Scholar] [CrossRef]
- Cotter, S.L.; Roberts, G.O.; Stuart, A.M.; White, D. MCMC Methods for Functions: Modifying Old Algorithms to Make Them Faster. Stat. Sci. 2013, 28, 424–446. [Google Scholar] [CrossRef]
- Craiu, R.V.; Levi, E. Approximate Methods for Bayesian Computation. Annu. Rev. Stat. Its Appl. 2023, 10, 379–399. [Google Scholar] [CrossRef]
- Van Ravenzwaaij, D.; Cassey, P.; Brown, S.D. A simple introduction to Markov Chain Monte–Carlo sampling. Psychon. Bull. Rev. 2018, 25, 143–154. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Karras, C.; Karras, A.; Avlonitis, M.; Sioutas, S. An Overview of MCMC Methods: From Theory to Applications. In Proceedings of the Artificial Intelligence Applications and Innovations, AIAI 2022 IFIP WG 12.5 International Workshops, Creta, Greece, 17–20 June 2022; Maglogiannis, I., Iliadis, L., Macintyre, J., Cortez, P., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 319–332. [Google Scholar]
- Snoek, J.; Larochelle, H.; Adams, R.P. Practical bayesian optimization of machine learning algorithms. Adv. Neural Inf. Process. Syst. 2012, 25. [Google Scholar]
- Theodoridis, S. Machine Learning: A Bayesian and Optimization Perspective; Academic Press: Cambridge, MA, USA, 2015. [Google Scholar]
- Elgeldawi, E.; Sayed, A.; Galal, A.R.; Zaki, A.M. Hyperparameter Tuning for Machine Learning Algorithms Used for Arabic Sentiment Analysis. Informatics 2021, 8, 79. [Google Scholar] [CrossRef]
- Band, S.S.; Janizadeh, S.; Saha, S.; Mukherjee, K.; Bozchaloei, S.K.; Cerdà, A.; Shokri, M.; Mosavi, A. Evaluating the Efficiency of Different Regression, Decision Tree, and Bayesian Machine Learning Algorithms in Spatial Piping Erosion Susceptibility Using ALOS/PALSAR Data. Land 2020, 9, 346. [Google Scholar] [CrossRef]
- Itoo, F.; Singh, S. Comparison and analysis of logistic regression, Naïve Bayes and KNN machine learning algorithms for credit card fraud detection. Int. J. Inf. Technol. 2021, 13, 1503–1511. [Google Scholar] [CrossRef]
- Wu, J.; Chen, X.Y.; Zhang, H.; Xiong, L.D.; Lei, H.; Deng, S.H. Hyperparameter optimization for machine learning models based on Bayesian optimization. J. Electron. Sci. Technol. 2019, 17, 26–40. [Google Scholar]
- Wei, X.; Wang, H. Stochastic stratigraphic modeling using Bayesian machine learning. Eng. Geol. 2022, 307, 106789. [Google Scholar] [CrossRef]
- Hitchcock, D.B. A history of the Metropolis–Hastings algorithm. Am. Stat. 2003, 57, 254–257. [Google Scholar] [CrossRef]
- Robert, C.; Casella, G.; Robert, C.P.; Casella, G. Metropolis–hastings algorithms. In Introducing Monte Carlo Methods with R; Springer: Berlin/Heidelberg, Germany, 2010; pp. 167–197. [Google Scholar]
- Hassibi, B.; Hansen, M.; Dimakis, A.G.; Alshamary, H.A.J.; Xu, W. Optimized Markov Chain Monte Carlo for Signal Detection in MIMO Systems: An Analysis of the Stationary Distribution and Mixing Time. IEEE Trans. Signal Process. 2014, 62, 4436–4450. [Google Scholar] [CrossRef] [Green Version]
- Chib, S.; Greenberg, E. Understanding the metropolis-hastings algorithm. Am. Stat. 1995, 49, 327–335. [Google Scholar]
- Hoogerheide, L.F.; van Dijk, H.K.; van Oest, R.D. Simulation Based Bayesian Econometric Inference: Principles and Some Recent Computational Advances. Econom. J. 2007, 215–280. [Google Scholar] [CrossRef] [Green Version]
- Johannes, M.; Polson, N. MCMC methods for continuous-time financial econometrics. In Handbook of Financial Econometrics: Applications; Elsevier: Amsterdam, The Netherlands, 2010; pp. 1–72. [Google Scholar]
- Flury, T.; Shephard, N. Bayesian inference based only on simulated likelihood: Particle filter analysis of dynamic economic models. Econom. Theory 2011, 27, 933–956. [Google Scholar] [CrossRef] [Green Version]
- Zuev, K.M.; Katafygiotis, L.S. Modified Metropolis–Hastings algorithm with delayed rejection. Probabilistic Eng. Mech. 2011, 26, 405–412. [Google Scholar] [CrossRef] [Green Version]
- Alotaibi, R.; Nassar, M.; Elshahhat, A. Computational Analysis of XLindley Parameters Using Adaptive Type-II Progressive Hybrid Censoring with Applications in Chemical Engineering. Mathematics 2022, 10, 3355. [Google Scholar] [CrossRef]
- Afify, A.Z.; Gemeay, A.M.; Alfaer, N.M.; Cordeiro, G.M.; Hafez, E.H. Power-modified kies-exponential distribution: Properties, classical and bayesian inference with an application to engineering data. Entropy 2022, 24, 883. [Google Scholar] [CrossRef] [PubMed]
- Elshahhat, A.; Elemary, B.R. Analysis for Xgamma parameters of life under Type-II adaptive progressively hybrid censoring with applications in engineering and chemistry. Symmetry 2021, 13, 2112. [Google Scholar] [CrossRef]
- Delmas, J.F.; Jourdain, B. Does waste-recycling really improve Metropolis-Hastings Monte Carlo algorithm? arXiv 2006, arXiv:math/0611949. [Google Scholar]
- Datta, S.; Gayraud, G.; Leclerc, E.; Bois, F.Y. Graph sampler: A C language software for fully Bayesian analyses of Bayesian networks. arXiv 2015, arXiv:1505.07228. [Google Scholar]
- Gamerman, D. Markov chain Monte Carlo for dynamic generalised linear models. Biometrika 1998, 85, 215–227. [Google Scholar] [CrossRef]
- Alvin J., K.C.; Vallisneri, M. Learning Bayes’ theorem with a neural network for gravitational-wave inference. arXiv 2019, arXiv:1909.05966. [Google Scholar]
- Vuckovic, J. Nonlinear MCMC for Bayesian Machine Learning. In Proceedings of the Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
- Green, S.R.; Gair, J. Complete parameter inference for GW150914 using deep learning. Mach. Learn. Sci. Technol. 2021, 2, 03LT01. [Google Scholar] [CrossRef]
- Martino, L.; Elvira, V. Metropolis sampling. arXiv 2017, arXiv:1704.04629. [Google Scholar]
- Catanach, T.A.; Vo, H.D.; Munsky, B. Bayesian inference of stochastic reaction networks using multifidelity sequential tempered Markov chain Monte Carlo. Int. J. Uncertain. Quantif. 2020, 10, 515–542. [Google Scholar] [CrossRef]
- Burke, N. Metropolis, Metropolis-Hastings and Gibbs Sampling Algorithms; Lakehead University Thunder Bay: Thunder Bay, ON, Canada, 2018. [Google Scholar]
- Apers, S.; Gribling, S.; Szilágyi, D. Hamiltonian Monte Carlo for efficient Gaussian sampling: Long and random steps. arXiv 2022, arXiv:2209.12771. [Google Scholar]
- Hoffman, M.D.; Gelman, A. The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res. 2014, 15, 1593–1623. [Google Scholar]
- Soluciones, D. Air Quality in Madrid (2001–2018). In Kaggle: A Platform for Data Science; Kaggle: San Francisco, CA, USA, 2018. [Google Scholar]
- Aguilar, P.M.; Carrera, L.G.; Segura, C.C.; Sánchez, M.I.T.; Peña, M.F.V.; Hernán, G.B.; Rodríguez, I.E.; Zapata, R.M.R.; Lucas, E.Z.D.; Álvarez, P.D.A.; et al. Relationship between air pollution levels in Madrid and the natural history of idiopathic pulmonary fibrosis: Severity and mortality. J. Int. Med. Res. 2021, 49, 03000605211029058. [Google Scholar] [CrossRef]
- Salvatier, J.; Wiecki, T.V.; Fonnesbeck, C. Probabilistic programming in Python using PyMC3. Peerj Comput. Sci. 2016, 2, e55. [Google Scholar] [CrossRef] [Green Version]
- Salvatier, J.; Wiecki, T.V.; Fonnesbeck, C. Sampling, PyMC3 Documentation. Online Documentation. 2021. Available online: https://www.pymc.io/projects/docs/en/v3/pymc-examples/examples/getting_started.html (accessed on 1 May 2023).
- Hossin, M.; Sulaiman, M.N. A review on evaluation metrics for data classification evaluations. Int. J. Data Min. Knowl. Manag. Process. 2015, 5, 1. [Google Scholar]
- Blair, G.S.; Henrys, P.; Leeson, A.; Watkins, J.; Eastoe, E.; Jarvis, S.; Young, P.J. Data science of the natural environment: A research roadmap. Front. Environ. Sci. 2019, 7, 121. [Google Scholar] [CrossRef] [Green Version]
- Kozlova, M.; Yeomans, J.S. Sustainability Analysis and Environmental Decision-Making Using Simulation, Optimization, and Computational Analytics. Sustainability 2022, 14, 1655. [Google Scholar] [CrossRef]
- Bhuiyan, M.A.M.; Sahi, R.K.; Islam, M.R.; Mahmud, S. Machine Learning Techniques Applied to Predict Tropospheric Ozone in a Semi-Arid Climate Region. Mathematics 2021, 9, 2901. [Google Scholar] [CrossRef]
- Del Giudice, D.; Löwe, R.; Madsen, H.; Mikkelsen, P.S.; Rieckermann, J. Comparison of two stochastic techniques for reliable urban runoff prediction by modeling systematic errors. Water Resour. Res. 2015, 51, 5004–5022. [Google Scholar] [CrossRef] [Green Version]
- Cheng, T.; Wang, J.; Li, X. A Hybrid Framework for Space–Time Modeling of Environmental Data. Geogr. Anal. 2011, 43, 188–210. [Google Scholar] [CrossRef]
- Chen, L.; He, Q.; Wan, H.; He, S.; Deng, M. Statistical computation methods for microbiome compositional data network inference. arXiv 2021, arXiv:2109.01993. [Google Scholar]
- Li, J.B.; Qu, S.; Metze, F.; Huang, P.Y. AudioTagging Done Right: 2nd comparison of deep learning methods for environmental sound classification. arXiv 2022, arXiv:2203.13448. [Google Scholar]
- Jubair, S.; Domaratzki, M. Crop genomic selection with deep learning and environmental data: A survey. Front. Artif. Intell. 2022, 5, 1040295. [Google Scholar] [CrossRef]
- Hsiao, H.C.W.; Chen, S.H.F.; Tsai, J.J.P. Deep Learning for Risk Analysis of Specific Cardiovascular Diseases Using Environmental Data and Outpatient Records. In Proceedings of the 2016 IEEE 16th International Conference on Bioinformatics and Bioengineering (BIBE), Taichung, Taiwan, 31 October–2 November 2016; pp. 369–372. [Google Scholar] [CrossRef]
- Jin, X.B.; Zheng, W.Z.; Kong, J.L.; Wang, X.Y.; Zuo, M.; Zhang, Q.C.; Lin, S. Deep-Learning Temporal Predictor via Bidirectional Self-Attentive Encoder–Decoder Framework for IOT-Based Environmental Sensing in Intelligent Greenhouse. Agriculture 2021, 11, 802. [Google Scholar] [CrossRef]
- Senthil, G.; Suganthi, P.; Prabha, R.; Madhumathi, M.; Prabhu, S.; Sridevi, S. An Enhanced Smart Intelligent Detecting and Alerting System for Industrial Gas Leakage using IoT in Sensor Network. In Proceedings of the 2023 5th International Conference on Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India, 23–25 January 2023; pp. 397–401. [Google Scholar] [CrossRef]
- Liu, B.; Zhou, Y.; Fu, H.; Fu, P.; Feng, L. Lightweight Self-Detection and Self-Calibration Strategy for MEMS Gas Sensor Arrays. Sensors 2022, 22, 4315. [Google Scholar] [CrossRef]
- Fascista, A. Toward Integrated Large-Scale Environmental Monitoring Using WSN/UAV/Crowdsensing: A Review of Applications, Signal Processing, and Future Perspectives. Sensors 2022, 22, 1824. [Google Scholar] [CrossRef] [PubMed]
- Karras, A.; Karras, C.; Schizas, N.; Avlonitis, M.; Sioutas, S. AutoML with Bayesian Optimizations for Big Data Management. Information 2023, 14, 223. [Google Scholar] [CrossRef]
- Schizas, N.; Karras, A.; Karras, C.; Sioutas, S. TinyML for Ultra-Low Power AI and Large Scale IoT Deployments: A Systematic Review. Future Internet 2022, 14, 363. [Google Scholar] [CrossRef]
- Karras, C.; Karras, A.; Giotopoulos, K.C.; Avlonitis, M.; Sioutas, S. Consensus Big Data Clustering for Bayesian Mixture Models. Algorithms 2023, 16, 245. [Google Scholar] [CrossRef]
- Krafft, P.M.; Zheng, J.; Pan, W.; Della Penna, N.; Altshuler, Y.; Shmueli, E.; Tenenbaum, J.B.; Pentland, A. Human collective intelligence as distributed Bayesian inference. arXiv 2016, arXiv:1608.01987. [Google Scholar]
- Winter, S.; Campbell, T.; Lin, L.; Srivastava, S.; Dunson, D.B. Machine Learning and the Future of Bayesian Computation. arXiv 2023, arXiv:2304.11251. [Google Scholar]
Pollutant | Good | Fair | Moderate | Poor | Very Poor | Extremely Poor |
---|---|---|---|---|---|---|
PM | 0–10 | 10–20 | 20–25 | 25–50 | 50–75 | 75–800 |
PM | 0–20 | 20–40 | 40–50 | 50–100 | 100–150 | 150–1200 |
NO | 0–40 | 40–90 | 90–120 | 120–230 | 230–340 | 340–1000 |
O | 0–50 | 50–100 | 100–130 | 130–240 | 240–380 | 380–800 |
SO | 0–100 | 100–200 | 200–350 | 350–500 | 500–750 | 750–1250 |
Draws | Tune | Chains | Time | Accuracy | Precision | AUC ROC |
---|---|---|---|---|---|---|
1000 | 800 | 2 | 16 s | 0.89628 | 0.8419 | |
1000 | 1000 | 2 | 23 s | 0.85246 | 0.85107 | |
2000 | 1600 | 4 | 35 s | 0.84532 | 0.867619 | 0.833 |
5000 | 2500 | 8 | 120 s | 0.838744 | 0.84916 | 0.829 |
5000 | 5000 | 8 | 144 s | 0.63898 | 0.64640 | 0.645 |
10,000 | 9000 | 8 | 226 s | 0.4918 | 0.4944 | 0.4944 |
Draws | Tune | Chains | Time | Accuracy | Precision | AUC ROC |
---|---|---|---|---|---|---|
1000 | 800 | 4 | 26 s | 0.185357 | 0.1871 | 0.1859 |
1000 | 1000 | 2 | 16 s | 0.4679 | 0.38946 | 0.4365 |
2000 | 2000 | 4 | 36 s | 0.3448 | 0.33836 | 0.3386 |
2000 | 2000 | 4 | 39 s | 0.70559 | 0.78998 | 0.6776 |
4000 | 3000 | 4 | 53 s | 0.7355 | 0.73815 | 0.7397 |
5000 | 5000 | 4 | 80 s | 0.74536 | 0.7701 | 0.7575 |
8000 | 8000 | 8 | 80 s | 0.66774 | 0.673757 | 0.6737 |
Threshold | TP | TN | FP | FN | Accuracy | Precision | Recall Specificity | AUC ROC |
---|---|---|---|---|---|---|---|---|
0.49 | 1,736,750 | 1,144,988 | 926,446 | 180,076 | 0.4611 | 0.4584 | 0.5526 | 0.5047 |
0.499 | 1,630,466 | 1,626,731 | 444,703 | 106,324 | 0.8553 | 0.7857 | 0.7851 | 0.8620 |
0.4999 | 1,574,588 | 1,738,770 | 332,664 | 162,202 | 0.8700 | 0.8255 | 0.8399 | 0.8730 |
0.5 | 1,567,495 | 1,750,169 | 321,265 | 169,295 | 0.8711 | 0.8708 | 0.8449 | 0.8737 |
0.5001 | 1,560,479 | 1,761,557 | 309,877 | 176,311 | 0.8723 | 0.8343 | 0.8505 | 0.8744 |
0.501 | 1,490,945 | 1,853,274 | 218,160 | 245,845 | 0.8781 | 0.8723 | 0.8945 | 0.8765 |
0.505 | 1,285,186 | 2,062,755 | 8679 | 451,604 | 0.9932 | 0.8678 | ||
0.506 | 1,271,575 | 465,215 | 792 | 2,070,642 | 0.8776 | 0.9993 | 0.999 | 0.8658 |
Mean of Target Column | Mean of Predicted Labels |
---|---|
0.45294 | 0.49240 |
Metrics | Bayesian Logistic Regression | Frequentist Logistic Regression |
---|---|---|
Accuracy | 0.8791 | 0.8923 |
Precision | 0.9932 | 0.9270 |
Recall/Specificity | 0.9958 | 0.9452 |
ROC AUC | 0.8678 | 0.9614 |
Time | 35.3 s | 35.3 s |
Confusion Matrix | [1285186,451604] [8679, 2062755] | [1440301, 296489] [113412, 1958022] |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Vlachou, E.; Karras, C.; Karras, A.; Tsolis, D.; Sioutas, S. EVCA Classifier: A MCMC-Based Classifier for Analyzing High-Dimensional Big Data. Information 2023, 14, 451. https://doi.org/10.3390/info14080451
Vlachou E, Karras C, Karras A, Tsolis D, Sioutas S. EVCA Classifier: A MCMC-Based Classifier for Analyzing High-Dimensional Big Data. Information. 2023; 14(8):451. https://doi.org/10.3390/info14080451
Chicago/Turabian StyleVlachou, Eleni, Christos Karras, Aristeidis Karras, Dimitrios Tsolis, and Spyros Sioutas. 2023. "EVCA Classifier: A MCMC-Based Classifier for Analyzing High-Dimensional Big Data" Information 14, no. 8: 451. https://doi.org/10.3390/info14080451
APA StyleVlachou, E., Karras, C., Karras, A., Tsolis, D., & Sioutas, S. (2023). EVCA Classifier: A MCMC-Based Classifier for Analyzing High-Dimensional Big Data. Information, 14(8), 451. https://doi.org/10.3390/info14080451