Analytics

Analytics, Vol. 5, Pages 9: Denoising Stock Price Time Series with Singular Spectrum Analysis for Enhanced Deep Learning Forecasting

Carol Anne Hargreaves — 2026-01-27

Analytics, Vol. 5, Pages 9: Denoising Stock Price Time Series with Singular Spectrum Analysis for Enhanced Deep Learning Forecasting

Analytics doi: 10.3390/analytics5010009

Authors: Carol Anne Hargreaves Zixian Fan

Aim: Stock price prediction remains a highly challenging task due to the complex and nonlinear nature of financial time series data. While deep learning (DL) has shown promise in capturing these nonlinear patterns, its effectiveness is often hindered by the low signal-to-noise ratio inherent in market data. This study aims to enhance the stock predictive performance and trading outcomes by integrating Singular Spectrum Analysis (SSA) with deep learning models for stock price forecasting and strategy development on the Australian Securities Exchange (ASX)50 index. Method: The proposed framework begins by applying SSA to decompose raw stock price time series into interpretable components, effectively isolating meaningful trends and eliminating noise. The denoised sequences are then used to train a suite of deep learning architectures, including Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), and hybrid CNN-LSTM models. These models are evaluated based on their forecasting accuracy and the profitability of the trading strategies derived from their predictions. Results: Experimental results demonstrated that the SSA-DL framework significantly improved the prediction accuracy and trading performance compared to baseline DL models trained on raw data. The best-performing model, SSA-CNN-LSTM, achieved a Sharpe Ratio of 1.88 and a return on investment (ROI) of 67%, indicating robust risk-adjusted returns and effective exploitation of the underlying market conditions. Conclusions: The integration of Singular Spectrum Analysis with deep learning offers a powerful approach to stock price prediction in noisy financial environments. By denoising input data prior to model training, the SSA-DL framework enhanced signal clarity, improved forecast reliability, and enabled the construction of profitable trading strategies. These findings suggested a strong potential for SSA-based preprocessing in financial time series modeling.

Analytics, Vol. 5, Pages 8: From Models to Metrics: A Governance Framework for Large Language Models in Enterprise AI and Analytics

Darshan Desai — 2026-01-11

Analytics, Vol. 5, Pages 8: From Models to Metrics: A Governance Framework for Large Language Models in Enterprise AI and Analytics

Analytics doi: 10.3390/analytics5010008

Authors: Darshan Desai Ashish Desai

Large language models (LLMs) and other foundation models are rapidly being woven into enterprise analytics workflows, where they assist with data exploration, forecasting, decision support, and automation. These systems can feel like powerful new teammates: creative, scalable, and tireless. Yet they also introduce distinctive risks related to opacity, brittleness, bias, and misalignment with organizational goals. Existing work on AI ethics, alignment, and governance provides valuable principles and technical safeguards, but enterprises still lack practical frameworks that connect these ideas to the specific metrics, controls, and workflows by which analytics teams design, deploy, and monitor LLM-powered systems. This paper proposes a conceptual governance framework for enterprise AI and analytics that is explicitly centered on LLMs embedded in analytics pipelines. The framework adopts a three-layered perspective—model and data alignment, system and workflow alignment, and ecosystem and governance alignment—that links technical properties of models to enterprise analytics practices, performance indicators, and oversight mechanisms. In practical terms, the framework shows how model and workflow choices translate into concrete metrics and inform real deployment, monitoring, and scaling decisions for LLM-powered analytics. We also illustrate how this framework can guide the design of controls for metrics, monitoring, human-in-the-loop structures, and incident response in LLM-driven analytics. The paper concludes with implications for analytics leaders and governance teams seeking to operationalize responsible, scalable use of LLMs in enterprise settings.

Analytics, Vol. 5, Pages 7: Predicting ESG Scores Using Machine Learning for Data-Driven Sustainable Investment

Sanskruti Patel — 2026-01-09

Analytics, Vol. 5, Pages 7: Predicting ESG Scores Using Machine Learning for Data-Driven Sustainable Investment

Analytics doi: 10.3390/analytics5010007

Authors: Sanskruti Patel Abhay Nath Pranav Desai

Environmental, social and governance (ESG) metrics increasingly inform sustainable investment yet suffer from inter-rater heterogeneity and incomplete reporting, limiting their utility for forward-looking allocation. In this study, we developed and validated a two-level stacked-ensemble machine-learning framework to predict total ESG risk scores for S&P 500 firms using a comprehensive feature set comprising pillar sub-scores, controversy measures, firm financials, categorical descriptors and geospatial environmental indicators. Data pre-processing combined median/mean imputation, one-hot encoding, normalization and rigorous feature engineering; models were trained with an 80:20 train–test split and hyperparameters tuned by k-fold cross-validation. The stacked ensemble substantially outperformed single-model baselines (RMSE = 1.006, MAE = 0.664, MAPE = 3.13%, R2 = 0.979, CV_RMSE_Mean = 1.383, CV_R2_Mean = 0.957), with LightGBM and gradient boosting as competitive comparators. Permutation importance and correlation analysis identified environmental and social components as primary drivers (environmental importance = 0.41; social = 0.32), with potential multicollinearity between component and aggregate scores. This study concludes that ensemble-based predictive analytics can produce reliable, actionable ESG estimates to enhance screening and prioritization in sustainable investment, while recommending human review for extreme predictions and further work to harmonize cross-provider score divergence.

Analytics, Vol. 5, Pages 6: Interference-Driven Scaling Variability in Burst-Based Loopless Invasion Percolation Models of Induced Seismicity

Ian Baughman — 2026-01-06

Analytics, Vol. 5, Pages 6: Interference-Driven Scaling Variability in Burst-Based Loopless Invasion Percolation Models of Induced Seismicity

Analytics doi: 10.3390/analytics5010006

Authors: Ian Baughman John B. Rundle

Many fluid-injection sequences display burst-like seismicity with approximate power-law event-size distributions whose exponents drift between catalogs. Classical percolation models instead predict fixed, dimension-dependent exponents and do not specify which geometric mechanisms could underlie such b-value variability. We address this gap using two loopless invasion percolation variants—the constrained Leath invasion percolation (CLIP) and avalanche invasion percolation (AIP) models—to generate synthetic burst catalogs and quantify how burst geometry modifies size–frequency statistics. For each model we measure burst-size distributions and an interference fraction, defined as the proportion of attempted growth steps that terminate on previously activated bonds. Single-burst clusters recover the Fisher exponent of classical percolation, whereas multi-burst sequences show systematic, dimension-dependent drift of the effective exponent with a burst number that is strongly correlated with the interference fraction. CLIP and AIP are indistinguishable under these diagnostics, indicating that interference-driven exponent drift is a generic feature of burst growth rather than a model-specific artifact. Mapping the size-distribution exponent to an equivalent Gutenberg–Richter b-value shows that increasing interference suppresses large bursts and produces b value ranges comparable to those reported for injection-induced seismicity, supporting the interpretation of interference as a geometric proxy for mechanical inhibition that limits the growth of large events in real fracture networks.

Analytics, Vol. 5, Pages 5: PSYCH—Psychometric Assessment of Large Language Model Characters: An Exploration of the German Language

Nane Kratzke — 2026-01-06

Analytics, Vol. 5, Pages 5: PSYCH—Psychometric Assessment of Large Language Model Characters: An Exploration of the German Language

Analytics doi: 10.3390/analytics5010005

Authors: Nane Kratzke Niklas Beuter André Drews Monique Janneck

Background: Existing evaluations of large language models (LLMs) largely emphasize linguistic and factual performance, while their psychometric characteristics and behavioral biases remain insufficiently examined, particularly beyond English-language contexts. This study presents a systematic psychometric screening of LLMs in German using the validated Big Five Inventory-2 (BFI-2). Methods: Thirty-two contemporary commercial and open-source LLMs completed all 60 BFI-2 items 60 times each (once with and once without having to justify their answers), yielding over 330,000 responses. Models answered independently, under male and female impersonation, and with and without required justifications. Responses were compared to German human reference data using Welch’s t-tests (p<0.01) to assess deviations, response stability, justification effects, and gender differences. Results: At the domain level, LLM personality profiles broadly align with human means. Facet-level analyses, however, reveal systematic deviations, including inflated agreement—especially in Agreeableness and Aesthetic Sensitivity—and reduced Negative Emotionality. Only a few models show minimal deviations. Justification prompts significantly altered responses in 56% of models, often increasing variability. Commercial models exhibited substantially higher response stability than open-source models. Gender impersonation affected up to 25% of BFI-2 items, reflecting and occasionally amplifying human gender differences. Conclusions: This study introduces a reproducible psychometric framework for benchmarking LLM behavior against validated human norms and shows that LLMs produce stable yet systematically biased personality-like response patterns. Psychometric screening could therefore complement traditional LLM evaluation in sensitive applications.

Analytics, Vol. 5, Pages 4: GSM: An Integrated GAM–SHAP–MCDA Framework for Stroke Risk Assessment

Rilwan Mustapha — 2025-12-29

Analytics, Vol. 5, Pages 4: GSM: An Integrated GAM–SHAP–MCDA Framework for Stroke Risk Assessment

Analytics doi: 10.3390/analytics5010004

Authors: Rilwan Mustapha Ashiribo Wusu Olusola Olabanjo Bamidele Adetunji

This study proposes GSM, an interpretable and operational GAM-SHAP-MCDA framework for stroke risk stratification by integrating generalized additive models (GAMs), a point-based clinical scoring system, SHAP-based explainability, and multi-criteria decision analysis (MCDA). Using a publicly available dataset of n=5110 individuals (4.87% stroke prevalence), a GAM was fitted to capture nonlinear effects of key physiological predictors, including age, average blood glucose level, and body mass index (BMI), together with linear effects for hypertension, heart disease, and categorical covariates. The estimated smooth functions revealed strong age-related risk acceleration beyond 60 years, threshold behavior for glucose levels above approximately 180mg/dL, and a non-monotonic BMI association with peak risk at moderate BMI ranges. In a comparative evaluation, the GAM achieved superior discrimination and calibration relative to classical logistic regression, with a mean AUC of 0.846 versus 0.812 and a lower Brier score (0.045 vs. 0.051). A calibration analysis yielded an intercept of −0.04 and a slope of 1.03, indicating near-ideal agreement between the predicted and observed risks. While high-capacity ensemble models such as XGBoost achieved slightly higher AUC values (0.862), the GAM attained near-upper-bound performance while retaining full interpretability. To enhance clinical usability, the GAM smooth effects were discretized into clinically interpretable bands and converted into an additive point-based risk score ranging from 0 to 42, which was subsequently calibrated to absolute stroke probability. The calibrated probabilities were incorporated into the TOPSIS and VIKOR MCDA frameworks, producing transparent and robust patient prioritization rankings. A SHAP analysis confirmed age, glucose, and cardiometabolic factors as dominant global contributors, aligning with the learned GAM structure. Overall, the proposed GAM–SHAP–MCDA framework demonstrates that near-state-of-the-art predictive performance can be achieved alongside transparency, calibration, and decision-oriented interpretability, supporting ethical and practical deployment of medical artificial intelligence for stroke risk assessment.

Analytics, Vol. 5, Pages 3: Can Length Limit for App Titles Benefit Consumers?

Saori Chiba — 2025-12-29

Analytics, Vol. 5, Pages 3: Can Length Limit for App Titles Benefit Consumers?

Analytics doi: 10.3390/analytics5010003

Authors: Saori Chiba Yu-Hsi Liu Chien-Yuan Sher Min-Hsueh Tsai

The App Store introduced a title-length limit for mobile apps in 2016, and similar policies were later adopted across the industry. This issue drew considerable attention from industry practitioners in the 2010s. Using both empirical and theoretical approaches, this paper examines the effectiveness of this policy and its welfare implications. Title length became an issue because some sellers assemble meaningful keywords in the app title to convey information to consumers, while others combine irrelevant yet popular keywords in an attempt to increase their app’s downloads. We hypothesize that when titles are short, title length is positively associated with an app’s performance because both honest and opportunistic sellers coexist in the market. However, due to the presence of opportunistic sellers, once titles become too long, this positive relationship disappears. We examine this hypothesis using a random sample of 1998 apps from the App Store in 2015. Our results show that for apps with titles longer than 30 characters, title length remains positively associated with app performance. However, for titles exceeding 50 characters, we do not have sufficient evidence to conclude that further increases in length continue to generate additional downloads. To interpret our empirical findings, we construct communication games between an app seller and a consumer, in which the equilibrium is characterized by a threshold. Based on our model and empirical observations, the 30-character limit might hurt consumers.

Analytics, Vol. 5, Pages 2: A Threshold Selection Method in Code Plagiarism Checking Function for Code Writing Problem in Java Programming Learning Assistant System Considering AI-Generated Codes

Perwira Annissa Dyah Permatasari — 2025-12-26

Analytics, Vol. 5, Pages 2: A Threshold Selection Method in Code Plagiarism Checking Function for Code Writing Problem in Java Programming Learning Assistant System Considering AI-Generated Codes

Analytics doi: 10.3390/analytics5010002

Authors: Perwira Annissa Dyah Permatasari Mustika Mentari Safira Adine Kinari Soe Thandar Aung Nobuo Funabiki Htoo Htoo Sandi Kyaw Khaing Hsu Wai

To support novice learners, the Java programming learning assistant system (JPLAS) has been developed with various features. Among them, code writing problem (CWP) assigns writing an answer code that passes a given test code. The correctness of an answer code is validated by running it on JUnit. In previous works, we implemented a code plagiarism checking function that calculates the similarity score for each pair of answer codes based on the Levenshtein distance. When the score is higher than a given threshold, this pair is regarded as plagiarism. However, a method for finding the proper threshold has not been studied. In addition, AI-generated codes have become threats in plagiarism, as AI has grown in popularity, which should be investigated. In this paper, we propose a threshold selection method based on Tukey’s IQR fences. It uses a custom upper threshold derived from the statistical distribution of similarity scores for each assignment. To better accommodate skewed similarity distributions, the method introduces a simple percentile-based adjustment for determining the upper threshold. We also design prompts to generate answer codes using generative AI and apply them to four AI models. For evaluation, we used a total of 745 source codes of two datasets. The first dataset consists of 420 answer codes across 12 CWP instances from 35 first-year undergraduate students in the State Polytechnic of Malang, Indonesia (POLINEMA). The second dataset includes 325 answer codes across five CWP assignments from 65 third-year undergraduate students at Okayama University, Japan. The applications of our proposals found the following: (1) any pair of student codes whose score is higher than the selected threshold has some evidence of plagiarism, (2) some student codes have a higher similarity than the threshold with AI-generated codes, indicating the use of generative AI, and (3) multiple AI models can generate code that resembles student-written code, despite adopting different implementations. The validity of our proposal is confirmed.

Analytics, Vol. 5, Pages 1: A Novel Magnificent Frigatebird Optimization Algorithm with Proposed Movement Strategies for Enhanced Global Search

Glykeria Kyrou — 2025-12-23

Analytics, Vol. 5, Pages 1: A Novel Magnificent Frigatebird Optimization Algorithm with Proposed Movement Strategies for Enhanced Global Search

Analytics doi: 10.3390/analytics5010001

Authors: Glykeria Kyrou Vasileios Charilogis Ioannis G. Tsoulos

Global optimization is a fundamental tool for addressing complex and nonlinear problems across scientific and technological domains. The primary objective of this work is to enhance the efficiency, stability, and convergence speed of the Magnificent Frigatebird Optimization (MFO) algorithm by introducing new strategies that strengthen both global exploration and local exploitation. To this end, we propose an improved version of MFO that incorporates three novel movement strategies (aggressive, conservative, and mixed), a BFGS-based local search procedure for more accurate solution refinement, and a dynamic termination criterion capable of detecting stagnation and reducing unnecessary function evaluations. The algorithm is extensively evaluated on a diverse set of benchmark functions, demonstrating substantially lower computational cost and higher reliability compared to classical evolutionary and swarm-based methods. The results confirm the effectiveness of the proposed modifications and highlight the potential of the enhanced MFO for application to demanding real-world optimization problems.

Analytics, Vol. 4, Pages 36: Assessing the Impact of Capital Expenditure on Corporate Profitability in South Korea’s Electronics Industry: A Regression Analysis Approach

Bomee Park — 2025-12-10

Analytics, Vol. 4, Pages 36: Assessing the Impact of Capital Expenditure on Corporate Profitability in South Korea’s Electronics Industry: A Regression Analysis Approach

Analytics doi: 10.3390/analytics4040036

Authors: Bomee Park Tetiana Paientko

This study investigates the relationship between capital expenditure (CAPEX) and long-term corporate profitability in South Korea’s electronics industry. Using panel data from 126 listed electronics firms covering 2005–2019, the research applies fixed-effects regression analysis to examine how CAPEX influences profitability, measured by EBITDA/total assets. The results confirm that CAPEX exerts a positive and statistically significant long-term effect on profitability, with stronger but not significantly different impacts for large firms compared to SMEs. The findings contribute to empirical evidence on capital investment efficiency and the implications of economies and diseconomies of scale in capital-intensive industries.

Analytics, Vol. 4, Pages 35: Option Pricing in the Approach of Integrating Market Risk Premium: Application to OTM Options

David Liu — 2025-11-21

Analytics, Vol. 4, Pages 35: Option Pricing in the Approach of Integrating Market Risk Premium: Application to OTM Options

Analytics doi: 10.3390/analytics4040035

Authors: David Liu

In this research, we summarize the results of implementing the market risk premium into the option valuation formulas of the Black–Scholes–Merton model for out-of-the-money (OTM) options. We show that derivative prices can partly depend on systematic market risk, which the BSM model ignores by construction. Specifically, empirical studies are conducted using 50ETF options obtained from the Shanghai Stock Exchange, covering the periods from January 2018 to September 2022 and from December 2023 to October 2025. The pricing of the OTM options shows that the adjusted BSM formulas exhibit better pricing performance compared with the market prices of the OTM options tested. Furthermore, a framework for the empirical analysis of option prices based on the Capital Asset Pricing Model (CAPM) or factor models is discussed, which may lead to option formulas using non-homogeneous heat equations. The later proposal requires further statistical testing using real market data but offers an alternative to the existing risk-neutral valuation of options.

Analytics, Vol. 4, Pages 34: Fan Loyalty and Price Elasticity in Sport: Insights from Major League Baseball’s Post-Pandemic Recovery

Soojin Choi — 2025-11-21

Analytics, Vol. 4, Pages 34: Fan Loyalty and Price Elasticity in Sport: Insights from Major League Baseball’s Post-Pandemic Recovery

Analytics doi: 10.3390/analytics4040034

Authors: Soojin Choi Fang Zheng Seung-Man Lee

The COVID-19 pandemic disrupted traditional patterns of sport consumption, raising questions about whether fans would return to stadiums and how sensitive they would be to ticket prices in the recovery period. This study reconceptualizes ticket price elasticity as a market-based indicator of fan loyalty and applies it to Major League Baseball (MLB) during 2021–2023. Using team–season attendance data from Baseball-Reference, primary-market ticket prices from the Team Marketing Report Fan Cost Index, and secondary-market prices from TicketIQ, we estimate log–log fixed-effects panel models to separate causal price responses from popularity-driven correlations. The results show a strongly negative elasticity of attendance with respect to primary-market prices (β ≈ −7.93, p < 0.001), indicating that higher ticket prices substantially reduce attendance, while secondary-market prices are positively associated with attendance, reflecting demand shocks rather than causal effects. Heterogeneity analyses reveal that brand strength, team performance, and game salience significantly moderate elasticity, supporting the interpretation of inelastic demand as revealed loyalty. These findings highlight the potential of elasticity as a Fan Loyalty Index, providing a replicable framework for measuring consumer resilience. The study offers practical insights for pricing strategy, fan segmentation, and engagement, while emphasizing the broader social role of sport in restoring community identity during post-pandemic recovery.

Analytics, Vol. 4, Pages 33: AI-Powered Chatbot for FDA Drug Labeling Information Retrieval: OpenAI GPT for Grounded Question Answering

Manasa Koppula — 2025-11-17

Analytics, Vol. 4, Pages 33: AI-Powered Chatbot for FDA Drug Labeling Information Retrieval: OpenAI GPT for Grounded Question Answering

Analytics doi: 10.3390/analytics4040033

Authors: Manasa Koppula Fnu Madhulika Navya Sreeramoju Praveen Kolimi

This study presents the development of an AI-powered chatbot designed to facilitate accurate and efficient retrieval of information from the FDA drug labeling documents. Leveraging OpenAI’s GPT-3.5-turbo model within a controlled, document-grounded question–answering framework, Chatbot was created, which can provide users with answers that are strictly limited to the content of the uploaded drug label, thereby minimizing hallucinations and enhancing traceability. A user-friendly interface built with Streamlit allows users to upload FDA labeling PDFs and pose natural language queries. The chatbot extracts relevant sections using PyMuPDF and regex-based segmentation and generates responses constrained to those sections. To evaluate performance, semantic similarity scores were computed between generated answers and ground truth text using Sentence Transformers. Results across 10 breast cancer drug labels demonstrate high semantic alignment, with most scores ranging from 0.7 to 0.9, indicating reliable summarization and contextual fidelity. The chatbot achieved high semantic similarity scores (≥0.95 for concise sections) and ROUGE scores, confirming strong semantic and textual alignment. Comparative analysis with GPT-5-chat and NotebookLM demonstrated that our approach maintains accuracy and section-specific fidelity across models. The current work is limited to a small dataset, focused on breast cancer drugs. Future work will expand to diverse therapeutic areas and incorporate BERTScore and expert-based validation.

Analytics, Vol. 4, Pages 32: Scale-Invariant Correspondence Analysis of Compositional Data

Vartan Choulakian — 2025-11-12

Analytics, Vol. 4, Pages 32: Scale-Invariant Correspondence Analysis of Compositional Data

Analytics doi: 10.3390/analytics4040032

Authors: Vartan Choulakian Jacques Allard

Correspondence analysis is a dimension reduction technique for visualizing a non-negative matrix N=(nij) of size I×J, particularly contingency tables or compositional datasets, but it depends on the row and column marginals of N. Three complementary transformations of the data T(N)=(T(ainijbj)) render CA of T(N) invariant for any ai>0 and bj>0: first, Greenacre’s scale-invariant approach, valid for positive data; second, Goodman’s marginal-free correspondence analysis, valid for positive or moderately sparse data; third, correspondence analysis of the sign-transformed matrix, sign(N)=(sign(nij)), valid for sparse or extremely sparse data. We demonstrate these three methods on four real-world datasets with varying levels of sparsity to compare their exploratory performance.

Analytics, Vol. 4, Pages 31: PlayMyData: A Statistical Analysis of a Video Game Dataset on Review Scores and Gaming Platforms

Christian Ellington — 2025-11-11

Analytics, Vol. 4, Pages 31: PlayMyData: A Statistical Analysis of a Video Game Dataset on Review Scores and Gaming Platforms

Analytics doi: 10.3390/analytics4040031

Authors: Christian Ellington Paramahansa Pramanik Haley K. Robinson

In recent years, video games have become an increasingly popular form of entertainment and enjoyment for consumers of all ages. Given their rapid rise in production, projects such as PlayMyData aim to organize the immense amounts of data that accompany these games into sets of data for public use in research, primarily games bound specifically to modern platforms that are still being actively developed or further improved. This study aims to examine the particular differences in video game review scores using this set of data across the four listed platforms—Nintendo, Xbox, PlayStation, and PC—for different gaming titles relating to each platform. Through analysis of variance (ANOVA) testing and several other statistical analyses, significant differences between the platforms were observed, with PC games receiving the highest amount of positive scores and consistently outperforming the other three platforms, Xbox and PlayStation trailing behind PC, and Nintendo receiving the lowest review scores overall. These results illustrate the influence of platforms and their differences on player ratings and provide insight for developers and market analysts seeking to develop and invest in console platform video games.

Analytics, Vol. 4, Pages 30: System Inertia Cost Forecasting Using Machine Learning: A Data-Driven Approach for Grid Energy Trading in Great Britain

Maitreyee Dey — 2025-10-23

Analytics, Vol. 4, Pages 30: System Inertia Cost Forecasting Using Machine Learning: A Data-Driven Approach for Grid Energy Trading in Great Britain

Analytics doi: 10.3390/analytics4040030

Authors: Maitreyee Dey Soumya Prakash Rana Preeti Patel

As modern power systems integrate more renewable and decentralised generation, maintaining grid stability has become increasingly challenging. This study proposes a data-driven machine learning framework for forecasting system inertia service costs—a key yet underexplored variable influencing energy trading and frequency stability in Great Britain. Using eight years (2017–2024) of National Energy System Operator (NESO) data, four models—Long Short-Term Memory (LSTM), Residual LSTM, eXtreme Gradient Boosting (XGBoost), and Light Gradient-Boosting Machine (LightGBM)—are comparatively analysed. LSTM-based models capture temporal dependencies, while ensemble methods effectively handle nonlinear feature relationships. Results demonstrate that LightGBM achieves the highest predictive accuracy, offering a robust method for inertia cost estimation and market intelligence. The framework contributes to strategic procurement planning and supports market design for a more resilient, cost-effective grid.

Analytics, Vol. 4, Pages 29: Distributional CNN-LSTM, KDE, and Copula Approaches for Multimodal Multivariate Data: Assessing Conditional Treatment Effects

Jong-Min Kim — 2025-10-21

Analytics, Vol. 4, Pages 29: Distributional CNN-LSTM, KDE, and Copula Approaches for Multimodal Multivariate Data: Assessing Conditional Treatment Effects

Analytics doi: 10.3390/analytics4040029

Authors: Jong-Min Kim

We introduce a distributional CNN-LSTM framework for probabilistic multivariate modeling and heterogeneous treatment effect (HTE) estimation. The model jointly captures complex dependencies among multiple outcomes and enables precise estimation of individual-level conditional average treatment effects (CATEs). In simulation studies with multivariate Gaussian mixtures, the CNN-LSTM demonstrates robust density estimation and strong CATE recovery, particularly as mixture complexity increases, while classical methods such as Kernel Density Estimation (KDE) and Gaussian Copulas may achieve higher log-likelihood or coverage in simpler scenarios. On real-world datasets, including Iris and Criteo Uplift, the CNN-LSTM achieves the lowest CATE RMSE, confirming its practical utility for individualized prediction, although KDE and Gaussian Copula approaches may perform better on global likelihood or coverage metrics. These results indicate that the CNN-LSTM can be trained efficiently on moderate-sized datasets while maintaining stable predictive performance. Overall, the framework is particularly valuable in applications requiring accurate individual-level effect estimation and handling of multimodal heterogeneity—such as personalized medicine, economic policy evaluation, and environmental risk assessment—with its primary strength being superior CATE recovery under complex outcome distributions, even when likelihood-based metrics favor simpler baselines.

Analytics, Vol. 4, Pages 28: Reservoir Computation with Networks of Differentiating Neuron Ring Oscillators

Alexander Yeung — 2025-10-20

Analytics, Vol. 4, Pages 28: Reservoir Computation with Networks of Differentiating Neuron Ring Oscillators

Analytics doi: 10.3390/analytics4040028

Authors: Alexander Yeung Peter DelMastro Arjun Karuvally Hava Siegelmann Edward Rietman Hananel Hazan

Reservoir computing is an approach to machine learning that leverages the dynamics of a complex system alongside a simple, often linear, machine learning model for a designated task. While many efforts have previously focused their attention on integrating neurons, which produce an output in response to large, sustained inputs, we focus on using differentiating neurons, which produce an output in response to large changes in input. Here, we introduce a small-world graph built from rings of differentiating neurons as a Reservoir Computing substrate. We find the coupling strength and network topology that enable these small-world networks to function as an effective reservoir. The dynamics of differentiating neurons naturally give rise to oscillatory dynamics when arranged in rings, where we study their computational use in the Reservoir Computing setting. We demonstrate the efficacy of these networks in the MNIST digit recognition task, achieving comparable performance of 90.65% to existing Reservoir Computing approaches. Beyond accuracy, we conduct systematic analysis of our reservoir’s internal dynamics using three complementary complexity measures that quantify neuronal activity balance, input dependence, and effective dimensionality. Our analysis reveals that optimal performance emerges when the reservoir operates with intermediate levels of neural entropy and input sensitivity, consistent with the edge-of-chaos hypothesis, where the system balances stability and responsiveness. The findings suggest that differentiating neurons can be a potential alternative to integrating neurons and can provide a sustainable future alternative for power-hungry AI applications.

Analytics, Vol. 4, Pages 27: Multiplicative Decomposition Model to Predict UK’s Long-Term Electricity Demand with Monthly and Hourly Resolution

Marie Baillon — 2025-10-06

Analytics, Vol. 4, Pages 27: Multiplicative Decomposition Model to Predict UK’s Long-Term Electricity Demand with Monthly and Hourly Resolution

Analytics doi: 10.3390/analytics4040027

Authors: Marie Baillon María Carmen Romano Ekkehard Ullner

The UK electricity market is changing to adapt to Net Zero targets and respond to disruptions like the Russia–Ukraine war. This requires strategic planning to decide on the construction of new electricity generation plants for a resilient UK electricity grid. Such planning is based on forecasting the UK electricity demand long-term (from 1 year and beyond). In this paper, we propose a long-term predictive model by identifying the main components of the UK electricity demand, modelling each of these components, and combining them in a multiplicative manner to deliver a single long-term prediction. To the best of our knowledge, this study is the first to apply a multiplicative decomposition model for long-term predictions at both monthly and hourly resolutions, combining neural networks with Fourier analysis. This approach is extremely flexible and accurate, with a mean absolute percentage error of 4.16% and 8.62% in predicting the monthly and hourly electricity demand, respectively, from 2019 to 2021.

Analytics, Vol. 4, Pages 26: Fairness in Predictive Marketing: Auditing and Mitigating Demographic Bias in Machine Learning for Customer Targeting

Sayee Phaneendhar Pasupuleti — 2025-10-01

Analytics, Vol. 4, Pages 26: Fairness in Predictive Marketing: Auditing and Mitigating Demographic Bias in Machine Learning for Customer Targeting

Analytics doi: 10.3390/analytics4040026

Authors: Sayee Phaneendhar Pasupuleti Jagadeesh Kola Sai Phaneendra Manikantesh Kodete Sree Harsha Palli

As organizations increasingly turn to machine learning for customer segmentation and targeted marketing, concerns about fairness and algorithmic bias have become more urgent. This study presents a comprehensive fairness audit and mitigation framework for predictive marketing models using the Bank Marketing dataset. We train logistic regression and random forest classifiers to predict customer subscription behavior and evaluate their performance across key demographic groups, including age, education, and job type. Using model explainability techniques such as SHAP and fairness metrics including disparate impact and true positive rate parity, we uncover notable disparities in model behavior that could result in discriminatory targeting. We implement three mitigation strategies—reweighing, threshold adjustment, and feature exclusion—and assess their effectiveness in improving fairness while preserving business-relevant performance metrics. Among these, reweighing produced the most balanced outcome, raising the Disparate Impact Ratio for older individuals from 0.65 to 0.82 and reducing the true positive rate parity gap by over 40%, with only a modest decline in precision (from 0.78 to 0.76). We propose a replicable workflow for embedding fairness auditing into enterprise BI systems and highlight the strategic importance of ethical AI practices in building accountable and inclusive marketing technologies. technologies.

Analytics, Vol. 4, Pages 25: Evolution Cybercrime—Key Trends, Cybersecurity Threats, and Mitigation Strategies from Historical Data

Muhammad Abdullah — 2025-09-18

Analytics, Vol. 4, Pages 25: Evolution Cybercrime—Key Trends, Cybersecurity Threats, and Mitigation Strategies from Historical Data

Analytics doi: 10.3390/analytics4030025

Authors: Muhammad Abdullah Muhammad Munib Nawaz Bilal Saleem Maila Zahra Effa binte Ashfaq Zia Muhammad

The landscape of cybercrime has undergone significant transformations over the past decade. Present-day threats include AI-generated attacks, deep fakes, 5G network vulnerabilities, cryptojacking, and supply chain attacks, among others. To remain resilient against contemporary threats, it is essential to examine historical data to gain insights that can inform cybersecurity strategies, policy decisions, and public awareness campaigns. This paper presents a comprehensive analysis of the evolution of cyber trends in state-sponsored attacks over the past 20 years, based on the council on foreign relations state-sponsored cyber operations (2005–present). The study explores the key trends, patterns, and demographic shifts in cybercrime victims, the evolution of complaints and losses, and the most prevalent cyber threats over the years. It also investigates the geographical distribution, the gender disparity in victimization, the temporal peaks of specific scams, and the most frequently reported internet crimes. The findings reveal a traditional cyber landscape, with cyber threats becoming more sophisticated and monetized. Finally, the article proposes areas for further exploration through a comprehensive analysis. It provides a detailed chronicle of the trajectory of cybercrimes, offering insights into its past, present, and future.

Analytics, Vol. 4, Pages 24: Meta-Analysis of Artificial Intelligence’s Influence on Competitive Dynamics for Small- and Medium-Sized Financial Institutions

Macy Cudmore — 2025-09-18

Analytics, Vol. 4, Pages 24: Meta-Analysis of Artificial Intelligence’s Influence on Competitive Dynamics for Small- and Medium-Sized Financial Institutions

Analytics doi: 10.3390/analytics4030024

Authors: Macy Cudmore David Mattie

Artificial intelligence adoption in financial services presents uncertain implications for competitive dynamics, particularly for smaller institutions. The literature on AI in finance is growing, but there remains a notable absence regarding the impacts on small- and medium-sized financial services firms. We conduct a meta-analysis combining a systematic literature review, sentiment bibliometrics, and network analysis to examine how AI is transforming competition across different firm sizes in the financial sector. Our analysis of 160 publications reveals predominantly positive academic sentiment toward AI in finance (mean positive sentiment 0.725 versus negative 0.586, Cohen’s d = 0.790, p < 0.0001), with anticipatory sentiment increasing significantly over time (β=2.10×10−2,p=0.007). However, network analysis reveals substantial conceptual fragmentation in the research discourse, with a low connectivity coefficient (ϕ=0.125) indicating that the field lacks unified terminology. These findings expose a critical knowledge gap: while scholars increasingly view AI as competitively advantageous, research has not coalesced around coherent models for understanding differential impacts across firm sizes. The absence of size-specific research leaves practitioners and policymakers without clear guidance on how AI adoption affects competitive positioning, particularly for smaller institutions that may face resource constraints or technological barriers. The research fragmentation identified here has direct implications for strategic planning, regulatory approaches, and employment dynamics in financial services.

Analytics, Vol. 4, Pages 23: Game-Theoretic Analysis of MEV Attacks and Mitigation Strategies in Decentralized Finance

Benjamin Appiah — 2025-09-15

Analytics, Vol. 4, Pages 23: Game-Theoretic Analysis of MEV Attacks and Mitigation Strategies in Decentralized Finance

Analytics doi: 10.3390/analytics4030023

Authors: Benjamin Appiah Daniel Commey Winful Bagyl-Bac Laurene Adjei Ebenezer Owusu

Maximal Extractable Value (MEV) presents a significant challenge to the fairness and efficiency of decentralized finance (DeFi). This paper provides a game-theoretic analysis of the strategic interactions within the MEV supply chain, involving searchers, builders, and validators. A three-stage game of incomplete information is developed to model these interactions. The analysis derives the Perfect Bayesian Nash Equilibria for primary MEV attack vectors, such as sandwich attacks, and formally characterizes attacker behavior. The research demonstrates that the competitive dynamics of the current MEV market are best described as Bertrand-style competition, which compels rational actors to engage in aggressive extraction that reduces overall system welfare in a prisoner’s dilemma-like outcome. To address these issues, the paper proposes and evaluates mechanism design solutions, including commit–reveal schemes and threshold encryption. The potential of these solutions to mitigate harmful MEV is quantified. Theoretical models are validated against on-chain data from the Ethereum blockchain, showing a close alignment between theoretical predictions and empirically observed market behavior.

Analytics, Vol. 4, Pages 22: Bankruptcy Prediction Using Machine Learning and Data Preprocessing Techniques

Kamil Samara — 2025-09-10

Analytics, Vol. 4, Pages 22: Bankruptcy Prediction Using Machine Learning and Data Preprocessing Techniques

Analytics doi: 10.3390/analytics4030022

Authors: Kamil Samara Apurva Shinde

Bankruptcy prediction is critical for financial risk management. This study demonstrates that machine learning models, particularly Random Forest, can substantially improve prediction accuracy compared to traditional approaches. Using data from 8262 U.S. firms (1999–2018), we evaluate Logistic Regression, SVM, Random Forest, ANN, and RNN in combination with robust data preprocessing steps. Random Forest achieved the highest prediction accuracy (~95%), far surpassing Logistic Regression (~57%). Key preprocessing steps included feature engineering of financial ratios, feature selection, class balancing using SMOTE, and scaling. The findings highlight that ensemble and deep learning models—particularly Random Forest and ANN—offer strong predictive performance, suggesting their suitability for early-warning financial distress systems.

Analytics, Vol. 4, Pages 21: Accurate Analytical Forms of Heaviside and Ramp Function

John Constantine Venetis — 2025-08-26

Analytics, Vol. 4, Pages 21: Accurate Analytical Forms of Heaviside and Ramp Function

Analytics doi: 10.3390/analytics4030021

Authors: John Constantine Venetis

In this paper, explicit exact representations of the Unit Step Function and Ramp Function are obtained. These important functions constitute fundamental concepts of operational calculus together with digital signal processing theory and are also involved in many other areas of applied sciences and engineering practices. In particular, according to a rigorous process from the viewpoint of Mathematical Analysis, the Unit Step Function and the Ramp Function are equivalently performed as bi-parametric single-valued functions with only one constraint imposed on each parameter. The novelty of this work, when compared with other investigations concerning accurate and/or approximate forms of Unit Step Function and/or Ramp Function, is that the proposed exact formulae are not exhibited in terms of miscellaneous special functions, e.g., Gamma Function, Biexponential Function, or any other special functions, such as Error Function, Complementary Error Function, Hyperbolic Function, or Orthogonal Polynomials. In this framework, one may deduce that these formulae may be much more practical, flexible, and useful in the computational procedures that are inserted into operational calculus and digital signal processing techniques as well as other engineering practices.

Analytics, Vol. 4, Pages 20: LINEX Loss-Based Estimation of Expected Arrival Time of Next Event from HPP and NHPP Processes Past Truncated Time

M. S. Aminzadeh — 2025-08-26

Analytics, Vol. 4, Pages 20: LINEX Loss-Based Estimation of Expected Arrival Time of Next Event from HPP and NHPP Processes Past Truncated Time

Analytics doi: 10.3390/analytics4030020

Authors: M. S. Aminzadeh

This article introduces a computational tool for Bayesian estimation of the expected time until the next event occurs in both homogeneous Poisson processes (HPPs) and non-homogeneous Poisson processes (NHPPs), following a truncated time. The estimation utilizes the linear exponential (LINEX) asymmetric loss function and incorporates both gamma and non-informative priors. Furthermore, it presents a minimax-type criterion to ascertain the optimal sample size required to achieve a specified percentage reduction in posterior risk. Simulation studies indicate that estimators employing gamma priors for both HPP and NHPP demonstrate greater accuracy compared to those based on non-informative priors and maximum likelihood estimates (MLE), provided that the proposed data-driven method for selecting hyperparameters is applied.

Analytics, Vol. 4, Pages 19: A Bounded Sine Skewed Model for Hydrological Data Analysis

Tassaddaq Hussain — 2025-08-13

Analytics, Vol. 4, Pages 19: A Bounded Sine Skewed Model for Hydrological Data Analysis

Analytics doi: 10.3390/analytics4030019

Authors: Tassaddaq Hussain Mohammad Shakil Mohammad Ahsanullah Bhuiyan Mohammad Golam Kibria

Hydrological time series frequently exhibit periodic trends with variables such as rainfall, runoff, and evaporation rates often following annual cycles. Seasonal variations further contribute to the complexity of these data sets. A critical aspect of analyzing such phenomena is estimating realistic return intervals, making the precise determination of these values essential. Given this importance, selecting an appropriate probability distribution is paramount. To address this need, we introduce a flexible probability model specifically designed to capture periodicity in hydrological data. We thoroughly examine its fundamental mathematical and statistical properties, including the asymptotic behavior of the probability density function (PDF) and hazard rate function (HRF), to enhance predictive accuracy. Our analysis reveals that the PDF exhibits polynomial decay as x→∞, ensuring heavy-tailed behavior suitable for extreme events. The HRF demonstrates decreasing or non-monotonic trends, reflecting variable failure risks over time. Additionally, we conduct a simulation study to evaluate the performance of the estimation method. Based on these results, we refine return period estimates, providing more reliable and robust hydrological assessments. This approach ensures that the model not only fits observed data but also captures the underlying dynamics of hydrological extremes.

Analytics, Vol. 4, Pages 18: Predictive Framework for Regional Patent Output Using Digital Economic Indicators: A Stacked Machine Learning and Geospatial Ensemble to Address R&D Disparities

Amelia Zhao — 2025-07-08

Analytics, Vol. 4, Pages 18: Predictive Framework for Regional Patent Output Using Digital Economic Indicators: A Stacked Machine Learning and Geospatial Ensemble to Address R&D Disparities

Analytics doi: 10.3390/analytics4030018

Authors: Amelia Zhao Peng Wang

As digital transformation becomes an increasingly central focus of national and regional policy agendas, parallel efforts are intensifying to stimulate innovation as a critical driver of firm competitiveness and high-quality economic growth. However, regional disparities in innovation capacity persist. This study proposes an integrated framework in which regionally tracked digital economy indicators are leveraged to predict firm-level innovation performance, measured through patent activity, across China. Drawing on a comprehensive dataset covering 13 digital economic indicators from 2013 to 2022, this study spans core, broad, and narrow dimensions of digital development. Spatial dependencies among these indicators are assessed using global and local spatial autocorrelation measures, including Moran’s I and Geary’s C, to provide actionable insights for constructing innovation-conducive environments. To model the predictive relationship between digital metrics and innovation output, this study employs a suite of supervised machine learning techniques—Random Forest, Extreme Learning Machine (ELM), Support Vector Machine (SVM), XGBoost, and stacked ensemble approaches. Our findings demonstrate the potential of digital infrastructure metrics to serve as early indicators of regional innovation capacity, offering a data-driven foundation for targeted policymaking, strategic resource allocation, and the design of adaptive digital innovation ecosystems.

Analytics, Vol. 4, Pages 17: Domestication of Source Text in Literary Translation Prevails over Foreignization

Emilio Matricciani — 2025-06-20

Analytics, Vol. 4, Pages 17: Domestication of Source Text in Literary Translation Prevails over Foreignization

Analytics doi: 10.3390/analytics4030017

Authors: Emilio Matricciani

Domestication is a translation theory in which the source text (to be translated) is matched to the foreign reader by erasing its original linguistic and cultural difference. This match aims at making the target text (translated text) more fluent. On the contrary, foreignization is a translation theory in which the foreign reader is matched to the source text. This paper mathematically explores the degree of domestication/foreignization in current translation practice of texts written in alphabetical languages. A geometrical representation of texts, based on linear combinations of deep–language parameters, allows us (a) to calculate a domestication index which measures how much domestication is applied to the source text and (b) to distinguish language families. An expansion index measures the relative spread around mean values. This paper reports statistics and results on translations of (a) Greek New Testament books in Latin and in 35 modern languages, belonging to diverse language families; and (b) English novels in Western languages. English and French, although attributed to different language families, mathematically almost coincide. The requirement of making the target text more fluent makes domestication, with varying degrees, universally adopted, so that a blind comparison of the same linguistic parameters of a text and its translation hardly indicates that they refer to each other.

Analytics, Vol. 4, Pages 16: The Classical Model of Type-Token Systems Compared with Items from the Standardized Project Gutenberg Corpus

Martin Tunnicliffe — 2025-06-05

Analytics, Vol. 4, Pages 16: The Classical Model of Type-Token Systems Compared with Items from the Standardized Project Gutenberg Corpus

Analytics doi: 10.3390/analytics4020016

Authors: Martin Tunnicliffe Gordon Hunter

We compare the “classical” equations of type-token systems, namely Zipf’s laws, Heaps’ law and the relationships between their indices, with data selected from the Standardized Project Gutenberg Corpus (SPGC). Selected items all exceed 100,000 word-tokens and are trimmed to 100,000 word-tokens each. With the most egregious anomalies removed, a dataset of 8432 items is examined in terms of the relationships between the Zipf and Heaps’ indices computed using the Maximum Likelihood algorithm. Zipf’s second (size) law indices suggest that the types vs. frequency distribution is log–log convex, with the high and low frequency indices showing weak but significant negative correlation. Under certain circumstances, the classical equations work tolerably well, though the level of agreement depends heavily on the type of literature and the language (Finnish being notably anomalous). The frequency vs. rank characteristics exhibit log–log linearity in the “middle range” (ranks 100–1000), as characterised by the Kolmogorov–Smirnov significance. For most items, the Heaps’ index correlates strongly with the low frequency Zipf index in a manner consistent with classical theory, while the high frequency indices are largely uncorrelated. This is consistent with a simple simulation.

Analytics, Vol. 4, Pages 15: Multiplicity Adjustments for Differences in Proportion Parameters in Multiple-Sample Misclassified Binary Data

Dewi Rahardja — 2025-05-28

Analytics, Vol. 4, Pages 15: Multiplicity Adjustments for Differences in Proportion Parameters in Multiple-Sample Misclassified Binary Data

Analytics doi: 10.3390/analytics4020015

Authors: Dewi Rahardja

Generally, following an omnibus (overall equality) test, multiple pairwise comparison (MPC) tests are typically conducted as the second step in a sequential testing procedure to identify which specific pairs (e.g., proportions) exhibit significant differences. In this manuscript, we develop maximum likelihood estimation (MLE) methods to construct three different types of confidence intervals (CIs) for multiple pairwise differences in proportions, specifically in contexts where both types of misclassifications (i.e., over-reporting and under-reporting) exist in multiple-sample binomial data. Our closed-form algorithm is straightforward to implement. Consequently, when dealing with multiple sample proportions, we can readily apply MPC adjustment procedures—such as Bonferroni, &Scaron;idák, and Dunn—to address the issue of multiplicity. This manuscript advances the existing literature by extending from scenarios with only one type of misclassification to those involving both. Furthermore, we demonstrate our methods using a real-world data example.

Analytics, Vol. 4, Pages 14: Analytical Modeling of Ancillary Items

John Wilson — 2025-05-19

Analytics, Vol. 4, Pages 14: Analytical Modeling of Ancillary Items

Analytics doi: 10.3390/analytics4020014

Authors: John Wilson

Airlines profitability increasingly depends on the sale of ancillary items such as seat selection, baggage fees, etc. The modeling of ancillary items is becoming more important in the analytics literature. Much of the modeling is stylized and not immediately applicable. This paper contains a review of the approaches and modeling assumptions made in the literature. The focus is on the assumptions made so that models may be evaluated for how effective they are for applications and to highlight gaps in the literature.

Analytics, Vol. 4, Pages 13: Artificial Intelligence Applied to the Analysis of Biblical Scriptures: A Systematic Review

Bruno Cesar Lima — 2025-04-11

Analytics, Vol. 4, Pages 13: Artificial Intelligence Applied to the Analysis of Biblical Scriptures: A Systematic Review

Analytics doi: 10.3390/analytics4020013

Authors: Bruno Cesar Lima Nizam Omar Israel Avansi Leandro Nunes de Castro

The Holy Bible is the most read book in the world, originally written in Aramaic, Hebrew, and Greek over a time span in the order of centuries by many people, and formed by a combination of various literary styles, such as stories, prophecies, poetry, instructions, and others. As such, the Bible is a complex text to be analyzed by humans and machines. This paper provides a systematic survey of the application of Artificial Intelligence (AI) and some of its subareas to the analysis of the Biblical scriptures. Emphasis is given to what types of tasks are being solved, what are the main AI algorithms used, and their limitations. The findings deliver a general perspective on how this field is being developed, along with its limitations and gaps. This research follows a procedure based on three steps: planning (defining the review protocol), conducting (performing the survey), and reporting (formatting the report). The results obtained show there are seven main tasks solved by AI in the Bible analysis: machine translation, authorship identification, part of speech tagging (PoS tagging), semantic annotation, clustering, categorization, and Biblical interpretation. Also, the classes of AI techniques with better performance when applied to Biblical text research are machine learning, neural networks, and deep learning. The main challenges in the field involve the nature and style of the language used in the Bible, among others.

Analytics, Vol. 4, Pages 12: Traffic Prediction with Data Fusion and Machine Learning

Juntao Qiu — 2025-04-09

Analytics, Vol. 4, Pages 12: Traffic Prediction with Data Fusion and Machine Learning

Analytics doi: 10.3390/analytics4020012

Authors: Juntao Qiu Yaping Zhao

Traffic prediction, as a core task to alleviate urban congestion and optimize the transport system, has limitations in the integration of multimodal data, making it difficult to comprehensively capture the complex spatio-temporal characteristics of the transport system. Although some studies have attempted to introduce multimodal data, they mostly rely on resource-intensive deep neural network architectures, which have difficultly meeting the demands of practical applications. To this end, we propose a traffic prediction framework based on simple machine learning techniques that effectively integrates property features, amenity features, and emotion features (PAE features). Validated with large-scale real datasets, the method demonstrates excellent prediction performance while significantly reducing computational complexity and deployment costs. This study demonstrates the great potential of simple machine learning techniques in multimodal data fusion, provides an efficient and practical solution for traffic prediction, and offers an effective alternative to resource-intensive deep learning methods, opening up new paths for building scalable traffic prediction systems.

Analytics, Vol. 4, Pages 11: Copula-Based Bayesian Model for Detecting Differential Gene Expression

Prasansha Liyanaarachchi — 2025-04-03

Analytics, Vol. 4, Pages 11: Copula-Based Bayesian Model for Detecting Differential Gene Expression

Analytics doi: 10.3390/analytics4020011

Authors: Prasansha Liyanaarachchi N. Rao Chaganty

Deoxyribonucleic acid, more commonly known as DNA, is a fundamental genetic material in all living organisms, containing thousands of genes, but only a subset exhibit differential expression and play a crucial role in diseases. Microarray technology has revolutionized the study of gene expression, with two primary types available for expression analysis: spotted cDNA arrays and oligonucleotide arrays. This research focuses on the statistical analysis of data from spotted cDNA microarrays. Numerous models have been developed to identify differentially expressed genes based on the red and green fluorescence intensities measured using these arrays. We propose a novel approach using a Gaussian copula model to characterize the joint distribution of red and green intensities, effectively capturing their dependence structure. Given the right-skewed nature of the intensity distributions, we model the marginal distributions using gamma distributions. Differentially expressed genes are identified using the Bayes estimate under our proposed copula framework. To evaluate the performance of our model, we conduct simulation studies to assess parameter estimation accuracy. Our results demonstrate that the proposed approach outperforms existing methods reported in the literature. Finally, we apply our model to Escherichia coli microarray data, illustrating its practical utility in gene expression analysis.

Analytics, Vol. 4, Pages 10: Unveiling the Impact of Socioeconomic and Demographic Factors on Graduate Salaries: A Machine Learning Explanatory Analytical Approach Using Higher Education Statistical Agency Data

Bassey Henshaw — 2025-03-11

Analytics, Vol. 4, Pages 10: Unveiling the Impact of Socioeconomic and Demographic Factors on Graduate Salaries: A Machine Learning Explanatory Analytical Approach Using Higher Education Statistical Agency Data

Analytics doi: 10.3390/analytics4010010

Authors: Bassey Henshaw Bhupesh Kumar Mishra William Sayers Zeeshan Pervez

Graduate salaries are a significant concern for graduates, employers, and policymakers, as various factors influence them. This study investigates determinants of graduate salaries in the UK, utilising survey data from HESA (Higher Education Statistical Agency) and integrating advanced machine learning (ML) explanatory techniques with statistical analytical methodologies. By employing multi-stage analyses alongside machine learning models such as decision trees, random forests and the explainability with SHAP stands for (Shapley Additive exPanations), this study investigates the influence of 21 socioeconomic and demographic variables on graduate salary outcomes. Key variables, including institutional reputation, age at graduation, socioeconomic classification, job qualification requirements, and domicile, emerged as critical determinants, with institutional reputation proving the most significant. Among ML methods, the decision tree achieved a standout with the highest accuracy through rigorous optimisation techniques, including oversampling and undersampling. SHAP highlighted the top 12 influential variables, providing actionable insights into the interplay between individual and systemic factors. Furthermore, the statistical analysis using ANOVA (Analysis of Variance) validated the significance of these variables, revealing intricate interactions that shape graduate salary dynamics. Additionally, domain experts’ opinions are also analysed to authenticate the findings. This research makes a unique contribution by combining qualitative contextual analysis with quantitative methodologies, machine learning explainability and domain experts’ views on addressing gaps in the existing identification of graduate salary predicting components. Additionally, the findings inform policy and educational interventions to reduce wage inequalities and promote equitable career opportunities. Despite limitations, such as the UK-specific dataset and the focus on socioeconomic and demographic variables, this study lays a robust foundation for future research in predictive modelling and graduate outcomes.

Analytics, Vol. 4, Pages 9: Updated Aims and Scope of Analytics

Carson K. Leung — 2025-03-06

Analytics, Vol. 4, Pages 9: Updated Aims and Scope of Analytics

Analytics doi: 10.3390/analytics4010009

Authors: Carson K. Leung

Analytics [...]

Analytics, Vol. 4, Pages 8: The Role of Cognitive Performance in Older Europeans’ General Health: Insights from Relative Importance Analysis

Eleni Serafetinidou — 2025-03-04

Analytics, Vol. 4, Pages 8: The Role of Cognitive Performance in Older Europeans’ General Health: Insights from Relative Importance Analysis

Analytics doi: 10.3390/analytics4010008

Authors: Eleni Serafetinidou Christina Parpoula

This study explores the role of cognitive performance in the general health of older Europeans aged 50 and over, focusing on gender differences, using data from 336,500 respondents in the sixth wave of the Survey of Health, Aging, and Retirement in Europe (SHARE). Cognitive functioning was assessed through self-rated reading and writing skills, orientation in time, numeracy, memory, verbal fluency, and word-list learning. General health status was estimated by constructing a composite index of physical and mental health-related measures, including chronic diseases, mobility limitations, depressive symptoms, self-perceived health, and the Global Activity Limitation Indicator. Participants were classified into good or poor health status, and logistic regression models assessed the predictive significance of cognitive variables on general health, supplemented by a relative importance analysis to estimate relative effect sizes. The results indicated that males had a 51.1% lower risk of reporting poor health than females, and older age was associated with a 4.0% increase in the odds of reporting worse health for both genders. Memory was the strongest predictor of health status (26% of the model R2), with a greater relative contribution than the other cognitive variables. No significant gender differences were found. While this study estimates the odds of reporting poorer health in relation to gender and various cognitive characteristics, adopting a lifespan approach could provide valuable insights into the longitudinal associations between cognitive functioning and health outcomes.

Analytics, Vol. 4, Pages 7: Towards Visual Analytics for Explainable AI in Industrial Applications

Kostiantyn Kucher — 2025-02-12

Analytics, Vol. 4, Pages 7: Towards Visual Analytics for Explainable AI in Industrial Applications

Analytics doi: 10.3390/analytics4010007

Authors: Kostiantyn Kucher Elmira Zohrevandi Carl A. L. Westin

As the levels of automation and reliance on modern artificial intelligence (AI) approaches increase across multiple industries, the importance of the human-centered perspective becomes more evident. Various actors in such industrial applications, including equipment operators and decision makers, have their needs and preferences that often do not align with the decisions produced by black-box models, potentially leading to mistrust and wasted productivity gain opportunities. In this paper, we examine these issues through the lenses of visual analytics and, more broadly, interactive visualization, and we argue that the methods and techniques from these fields can lead to advances in both academic research and industrial innovations concerning the explainability of AI models. To address the existing gap within and across the research and application fields, we propose a conceptual framework for visual analytics design and evaluation for such scenarios, followed by a preliminary roadmap and call to action for the respective communities.

Analytics, Vol. 4, Pages 6: Monetary Policy Sentiment and Its Influence on Healthcare and Technology Markets: A Transformer Model Approach

Dongnan Liu — 2025-02-11

Analytics, Vol. 4, Pages 6: Monetary Policy Sentiment and Its Influence on Healthcare and Technology Markets: A Transformer Model Approach

Analytics doi: 10.3390/analytics4010006

Authors: Dongnan Liu Jong-Min Kim

This study investigates how the Federal Open Market Committee’s (FOMC) statements impact healthcare spending, mental health trends, and stock performance in healthcare and tech sectors By analyzing FOMC’s sentiment from 2018 to 2024, we found that higher sentiment correlates with increased depressive disorders (2019–2021) and tech stock returns, especially for the “Magnificent Seven” (like Apple and Amazon). Although healthcare stocks showed weaker ties to sentiment, Granger causality tests suggest some influence, hinting at ways to adjust stock strategies based on FOMC trends. These results highlight how central bank communication can shape both mental health dynamics and investment decisions in healthcare and technology.

Analytics, Vol. 4, Pages 5: A Comparative Analysis of Machine Learning and Deep Learning Techniques for Accurate Market Price Forecasting

Olamilekan Shobayo — 2025-02-11

Analytics, Vol. 4, Pages 5: A Comparative Analysis of Machine Learning and Deep Learning Techniques for Accurate Market Price Forecasting

Analytics doi: 10.3390/analytics4010005

Authors: Olamilekan Shobayo Sidikat Adeyemi-Longe Olusogo Popoola Obinna Okoyeigbo

This study compares three machine learning and deep learning models—Support Vector Regression (SVR), Recurrent Neural Networks (RNN), and Long Short-Term Memory (LSTM)—for predicting market prices using the NGX All-Share Index dataset. The models were evaluated using multiple error metrics, including Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Square Error (RMSE), Mean Percentage Error (MPE), and R-squared. RNN and LSTM were tested with both 30 and 60-day windows, with performance compared to SVR. LSTM delivered better R-squared values, with a 60-day LSTM achieving the best accuracy (R-squared = 0.993) when using a combination of endogenous market data and technical indicators. SVR showed reliable results in certain scenarios but struggled in fold 2 with a sudden spike that shows a high probability of not capturing the entire underlying NGX pattern in the dataset correctly, as witnessed by the high validation loss during the period. Additionally, RNN faced the vanishing gradient problem that limits its long-term performance. Despite challenges, LSTM’s ability to handle temporal dependencies, especially with the inclusion of On-Balance Volume, led to significant improvements in prediction accuracy. The use of the Optuna optimisation framework further enhanced model training and hyperparameter tuning, contributing to the performance of the LSTM model.

Analytics, Vol. 4, Pages 4: Personalizing Multimedia Content Recommendations for Intelligent Vehicles Through Text–Image Embedding Approaches

Jin-A Choi — 2025-02-05

Analytics, Vol. 4, Pages 4: Personalizing Multimedia Content Recommendations for Intelligent Vehicles Through Text–Image Embedding Approaches

Analytics doi: 10.3390/analytics4010004

Authors: Jin-A Choi Taekeun Hong Kiho Lim

The ability to automate and personalize the recommendation of multimedia contents to consumers has been gaining significant attention recently. The burgeoning demand for digitization and automation of formerly analog communication processes has caught the attention of researchers and professionals alike. In light of the recent interest and anticipated transition to fully autonomous vehicles, this study proposes a text–image embedding method recommender system for the optimization of personalized multimedia content for in-vehicle infotainment. This study leverages existing pre-trained text embedding models and pre-trained image feature extraction methods. Previous research to date has focused mainly on textual-only or image-only analyses. By employing similarity measurements, this study demonstrates how recommendation of the most relevant multimedia content to consumers is enhanced through text–image embedding.

Analytics, Vol. 4, Pages 3: A Fuzzy Analytical Network Process Framework for Prioritizing Competitive Intelligence in Startups

Arman Golshan — 2025-01-14

Analytics, Vol. 4, Pages 3: A Fuzzy Analytical Network Process Framework for Prioritizing Competitive Intelligence in Startups

Analytics doi: 10.3390/analytics4010003

Authors: Arman Golshan Soheila Sardar Seyed Faraz Mahdavi Ardestani Paria Sadeghian

Competitive intelligence (CI) is a critical tool for startups, enabling informed decision making through the systematic gathering and analysis of relevant information. This study aims to identify and prioritize the key factors influencing CI in startups, providing actionable insights for entrepreneurs, educators, and support organizations. Through a systematic literature review, key variables and components impacting competitive intelligence were identified. Two surveys were conducted to refine these components. The first employed a five-point Likert scale to evaluate the significance of each component, while the second used a pairwise comparison approach involving ten experts in CI and startup mentorship. Utilizing the fuzzy Analytical Network Process (ANP), this study ranked Technology Intelligence as the most critical factor, followed by market and Strategic Intelligence. Competitor Intelligence and Internet intelligence were deemed moderately important, while Organizational Intelligence ranked lowest. These findings emphasize the importance of technology-driven insights and market awareness in fostering startups’ competitive advantage and informed decision making. This study provides a structured framework to guide startups in prioritizing CI efforts, offering practical strategies for navigating dynamic market conditions and achieving long-term success.

Analytics, Vol. 4, Pages 2: Use of Hazard Functions for Determining Power-Law Behaviour in Data

Joseph D. Bailey — 2025-01-09

Analytics, Vol. 4, Pages 2: Use of Hazard Functions for Determining Power-Law Behaviour in Data

Analytics doi: 10.3390/analytics4010002

Authors: Joseph D. Bailey

Determining the ‘best-fitting’ distribution for data is an important problem in data analysis. Specifically, observing how the distribution of data changes as values below (or above) a threshold are omitted from analyses can be of use in various applications, from animal movement to the modelling of natural phenomena. Such truncated distributions, known as hazard functions, are widely studied and well understood in survival analysis, although rarely widely used in data analysis. Here, by considering the hazard and reverse-hazard functions, we demonstrate a qualitative assessment of the ‘best-fit’ distribution of data. Specifically, we highlight the potential advantages of this method when determining whether power-law behaviour may or may not be present in data. Finally, we demonstrate this approach using some real-world datasets.

Analytics, Vol. 4, Pages 1: Uncovering Patterns and Trends in Big Data-Driven Research Through Text Mining of NSF Award Synopses

Arielle King — 2025-01-06

Analytics, Vol. 4, Pages 1: Uncovering Patterns and Trends in Big Data-Driven Research Through Text Mining of NSF Award Synopses

Analytics doi: 10.3390/analytics4010001

Authors: Arielle King Sayed A. Mostafa

The rapid expansion of big data has transformed research practices across disciplines, yet disparities exist in its adoption among U.S. institutions of higher education. This study examines trends in NSF-funded big data-driven research across research domains, institutional classifications, and directorates. Using a quantitative approach and natural language processing (NLP) techniques, we analyzed NSF awards from 2006 to 2022, focusing on seven NSF research areas: Biological Sciences, Computer and Information Science and Engineering, Engineering, Geosciences, Mathematical and Physical Sciences, Social, Behavioral and Economic Sciences, and STEM Education (formally known as Education and Human Resources). Findings indicate a significant increase in big data-related awards over time, with CISE (Computer and Information Science and Engineering) leading in funding. Machine learning and artificial intelligence are dominant themes across all institutions’ classifications. Results show that R1 and non-minority-serving institutions receive the majority of big data-driven research funding, though HBCUs have seen recent growth due to national diversity initiatives. Topic modeling reveals key subdomains such as cybersecurity and bioinformatics benefiting from big data, while areas like Biological Sciences and Social Sciences engage less with these methods. These findings suggest the need for broader support and funding to foster equitable adoption of big data methods across institutions and disciplines.

Analytics, Vol. 3, Pages 493-507: Advancements in Predictive Maintenance: A Bibliometric Review of Diagnostic Models Using Machine Learning Techniques

Nontuthuzelo Lindokuhle Vithi — 2024-12-10

Analytics, Vol. 3, Pages 493-507: Advancements in Predictive Maintenance: A Bibliometric Review of Diagnostic Models Using Machine Learning Techniques

Analytics doi: 10.3390/analytics3040028

Authors: Nontuthuzelo Lindokuhle Vithi Colin Chibaya

This bibliometric review investigates the advancements in machine learning techniques for predictive maintenance, focusing on the use of Artificial Neural Networks (ANNs) and Support Vector Machines (SVMs) for fault detection in wheelset axle bearings. Using data from Scopus and Web of Science, the review analyses key trends, influential publications, and significant contributions to the field from 2000 to 2024. The findings highlight the performance of ANNs in handling large datasets and modelling complex, non-linear relationships, as well as the high accuracy of SVMs in fault classification tasks, particularly with small-to-medium-sized datasets. However, the study also identifies several limitations, including the dependency on high-quality data, significant computational resource requirements, limited model adaptability, interpretability challenges, and practical implementation complexities. This review provides valuable insights for researchers and engineers, guiding the selection of appropriate diagnostic models and highlighting opportunities for future research. Addressing the identified limitations is crucial for the broader adoption and effectiveness of machine learning-based predictive maintenance strategies across various industrial contexts.

Analytics, Vol. 3, Pages 476-492: NPI-WGNN: A Weighted Graph Neural Network Leveraging Centrality Measures and High-Order Common Neighbor Similarity for Accurate ncRNA–Protein Interaction Prediction

Fatemeh Khoushehgir — 2024-12-02

Analytics, Vol. 3, Pages 476-492: NPI-WGNN: A Weighted Graph Neural Network Leveraging Centrality Measures and High-Order Common Neighbor Similarity for Accurate ncRNA–Protein Interaction Prediction

Analytics doi: 10.3390/analytics3040027

Authors: Fatemeh Khoushehgir Zahra Noshad Morteza Noshad Sadegh Sulaimany

Predicting ncRNA–protein interactions (NPIs) is essential for understanding regulatory roles in cellular processes and disease mechanisms, yet experimental methods are costly and time-consuming. In this study, we propose NPI-WGNN, a novel weighted graph neural network model designed to enhance NPI prediction by incorporating topological insights from graph structures. Our approach introduces a bipartite version of the high-order common neighbor (HOCN) similarity metric to assign edge weights in an ncRNA–protein network, refining node embeddings via weighted node2vec. We further enrich these embeddings with centrality measures, such as degree and Katz centralities, to capture network hierarchy and connectivity. To optimize prediction accuracy, we employ a hybrid GNN architecture that combines graph convolutional network (GCN), graph attention network (GAT), and GraphSAGE layers, each contributing unique advantages: GraphSAGE offers scalability, GCN provides a global structural perspective, and GAT applies dynamic neighbor weighting. An ablation study confirms the complementary strengths of these layers, showing that their integration improves predictive accuracy and robustness across varied graph complexities. Experimental results on three benchmark datasets demonstrate that NPI-WGNN outperforms state-of-the-art methods, achieving up to 96.1% accuracy, 97.5% sensitivity, and an F1-score of 0.96, positioning it as a robust and accurate framework for ncRNA–protein interaction prediction.

Analytics, Vol. 3, Pages 461-475: Breast Cancer Classification Using Fine-Tuned SWIN Transformer Model on Mammographic Images

Oluwatosin Tanimola — 2024-11-11

Analytics, Vol. 3, Pages 461-475: Breast Cancer Classification Using Fine-Tuned SWIN Transformer Model on Mammographic Images

Analytics doi: 10.3390/analytics3040026

Authors: Oluwatosin Tanimola Olamilekan Shobayo Olusogo Popoola Obinna Okoyeigbo

Breast cancer is the most prevalent type of disease among women. It has become one of the foremost causes of death among women globally. Early detection plays a significant role in administering personalized treatment and improving patient outcomes. Mammography procedures are often used to detect early-stage cancer cells. This traditional method of mammography while valuable has limitations in its potential for false positives and negatives, patient discomfort, and radiation exposure. Therefore, there is a probe for more accurate techniques required in detecting breast cancer, leading to exploring the potential of machine learning in the classification of diagnostic images due to its efficiency and accuracy. This study conducted a comparative analysis of pre-trained CNNs (ResNet50 and VGG16) and vision transformers (ViT-base and SWIN transformer) with the inclusion of ViT-base trained from scratch model architectures to effectively classify mammographic breast cancer images into benign and malignant cases. The SWIN transformer exhibits superior performance with 99.9% accuracy and a precision of 99.8%. These findings demonstrate the efficiency of deep learning to accurately classify mammographic breast cancer images for the diagnosis of breast cancer, leading to improvements in patient outcomes.

Analytics, Vol. 3, Pages 449-460: Modified Bayesian Information Criterion for Item Response Models in Planned Missingness Test Designs

Alexander Robitzsch — 2024-11-08

Analytics, Vol. 3, Pages 449-460: Modified Bayesian Information Criterion for Item Response Models in Planned Missingness Test Designs

Analytics doi: 10.3390/analytics3040025

Authors: Alexander Robitzsch

The Bayesian information criterion (BIC) is a widely used statistical tool originally derived for fully observed data. The BIC formula includes the sample size and the number of estimated parameters in the penalty term. However, not all variables are available for every subject in planned missingness designs. This article demonstrates that a modified BIC, tailored for planned missingness designs, outperforms the original BIC. The modification adjusts the penalty term by using the average number of estimable parameters per subject rather than the total number of model parameters. This new criterion was successfully applied to item response theory models in two simulation studies. We recommend that future studies utilizing planned missingness designs adopt the modified BIC formula proposed here.

Analytics, Vol. 3, Pages 439-448: Adaptive Weighted Multiview Kernel Matrix Factorization and Its Application in Alzheimer’s Disease Analysis

Yarui Cao — 2024-11-04

Analytics, Vol. 3, Pages 439-448: Adaptive Weighted Multiview Kernel Matrix Factorization and Its Application in Alzheimer’s Disease Analysis

Analytics doi: 10.3390/analytics3040024

Authors: Yarui Cao Kai Liu

Recent technology and equipment advancements have provided us with opportunities to better analyze Alzheimer’s disease (AD), where we could collect and employ the data from different image and genetic modalities that may potentially enhance the predictive performance. To perform better clustering in AD analysis, in this paper, we propose a novel model to leverage data from all different modalities/views, which can learn the weights of each view adaptively. Different from previous vanilla Non-negative matrix factorization which assumes data is linearly separable, we propose a simple yet efficient method based on kernel matrix factorization, which is not only able to deal with non-linear data structure but also can achieve better prediction accuracy. Experimental results on the ADNI dataset demonstrate the effectiveness of our proposed method, which indicates promising prospects for kernel application in AD analysis.

Analytics, Vol. 3, Pages 425-438: Electric Vehicle Sentiment Analysis Using Large Language Models

Hemlata Sharma — 2024-11-01

Analytics, Vol. 3, Pages 425-438: Electric Vehicle Sentiment Analysis Using Large Language Models

Analytics doi: 10.3390/analytics3040023

Authors: Hemlata Sharma Faiz Ud Din Bayode Ogunleye

Sentiment analysis is a technique used to understand the public’s opinion towards an event, product, or organization. For example, sentiment analysis can be used to understand positive or negative opinions or attitudes towards electric vehicle (EV) brands. This provides companies with valuable insight into the public’s opinion of their products and brands. In the field of natural language processing (NLP), transformer models have shown great performance compared to traditional machine learning algorithms. However, these models have not been explored extensively in the EV domain. EV companies are becoming significant competitors in the automotive industry and are projected to cover up to 30% of the United States light vehicle market by 2030 In this study, we present a comparative study of large language models (LLMs) including bidirectional encoder representations from transformers (BERT), robustly optimised BERT (RoBERTa), and a generalised autoregressive pre-training method (XLNet) using Lucid Motors and Tesla Motors YouTube datasets. Results evidenced that LLMs like BERT and her variants are off-the-shelf algorithms for sentiment analysis, specifically when fine-tuned. Furthermore, our findings present the need for domain adaptation whilst utilizing LLMs. Finally, the experimental results showed that RoBERTa achieved consistent performance across the EV datasets with an F1 score of at least 92%.

Analytics, Vol. 3, Pages 406-424: The Analyst’s Hierarchy of Needs: Grounded Design Principles for Tailored Intelligence Analysis Tools

Antonio E. Girona — 2024-10-29

Analytics, Vol. 3, Pages 406-424: The Analyst’s Hierarchy of Needs: Grounded Design Principles for Tailored Intelligence Analysis Tools

Analytics doi: 10.3390/analytics3040022

Authors: Antonio E. Girona James C. Peters Wenyuan Wang R. Jordan Crouser

Intelligence analysis involves gathering, analyzing, and interpreting vast amounts of information from diverse sources to generate accurate and timely insights. Tailored tools hold great promise in providing individualized support, enhancing efficiency, and facilitating the identification of crucial intelligence gaps and trends where traditional tools fail. The effectiveness of tailored tools depends on an analyst’s unique needs and motivations, as well as the broader context in which they operate. This paper describes a series of focus discovery exercises that revealed a distinct hierarchy of needs for intelligence analysts. This reflection on the balance between competing needs is of particular value in the context of intelligence analysis, where the compartmentalization required for security can make it difficult to group design patterns in stakeholder values. We hope that this study will enable the development of more effective tools, supporting the well-being and performance of intelligence analysts as well as the organizations they serve.

Analytics, Vol. 3, Pages 389-405: Directed Topic Extraction with Side Information for Sustainability Analysis

Maria Osipenko — 2024-09-11

Analytics, Vol. 3, Pages 389-405: Directed Topic Extraction with Side Information for Sustainability Analysis

Analytics doi: 10.3390/analytics3030021

Authors: Maria Osipenko

Topic analysis represents each document in a text corpus in a low-dimensional latent topic space. In some cases, the desired topic representation is subject to specific requirements or guidelines constituting side information. For instance, sustainability-aware investors might be interested in automatically assessing aspects of firm sustainability based on the textual content of its corporate reports, focusing on the established 17 UN sustainability goals. The main corpus consists of the corporate report texts, while the texts containing the definitions of the 17 UN sustainability goals represent the side information. Under the assumption that both text corpora share a common low-dimensional subspace, we propose representing them in such a space via directed topic extraction using matrix co-factorization. Both the main and the side text corpora are first represented as term–context matrices, which are then jointly decomposed into word–topic and topic–context matrices. The word–topic matrix is common to both text corpora, whereas the topic–context matrices contain specific representations in the shared topic space. A nuisance parameter, which allows us to shift the focus between the error minimization of individual factorization terms, controls the extent to which the side information is taken into account. With our approach, documents from the main and the side corpora can be related to each other in the resulting latent topic space. That is, the corporate reports are represented in the same latent topic space as the descriptions of the 17 UN sustainability goals, enabling a structured automatic sustainability assessment of the textual report’s content. We provide an algorithm for such directed topic extraction and propose techniques for visualizing and interpreting the results.

Analytics, Vol. 3, Pages 368-388: SIMEX-Based and Analytical Bias Corrections in Stocking–Lord Linking

Alexander Robitzsch — 2024-08-06

Analytics, Vol. 3, Pages 368-388: SIMEX-Based and Analytical Bias Corrections in Stocking–Lord Linking

Analytics doi: 10.3390/analytics3030020

Authors: Alexander Robitzsch

Stocking–Lord (SL) linking is a popular linking method for group comparisons based on dichotomous item responses. This article proposes a bias correction technique based on the simulation extrapolation (SIMEX) method for SL linking in the 2PL model in the presence of uniform differential item functioning (DIF). The SIMEX-based method is compared to the analytical bias correction methods of SL linking. It turned out in a simulation study that SIMEX-based SL linking performed best, is easy to implement, and can be adapted to other linking methods straightforwardly.

Analytics, Vol. 3, Pages 344-367: Comparative Analysis of Nature-Inspired Metaheuristic Techniques for Optimizing Phishing Website Detection

Thomas Nagunwa — 2024-08-06

Analytics, Vol. 3, Pages 344-367: Comparative Analysis of Nature-Inspired Metaheuristic Techniques for Optimizing Phishing Website Detection

Analytics doi: 10.3390/analytics3030019

Authors: Thomas Nagunwa

The increasing number, frequency, and sophistication of phishing website-based attacks necessitate the development of robust solutions for detecting phishing websites to enhance the overall security of cyberspace. Drawing inspiration from natural processes, nature-inspired metaheuristic techniques have been proven to be efficient in solving complex optimization problems in diverse domains. Following these successes, this research paper aims to investigate the effectiveness of metaheuristic techniques, particularly Genetic Algorithms (GAs), Differential Evolution (DE), and Particle Swarm Optimization (PSO), in optimizing the hyperparameters of machine learning (ML) algorithms for detecting phishing websites. Using multiple datasets, six ensemble classifiers were trained on each dataset and their hyperparameters were optimized using each metaheuristic technique. As a baseline for assessing performance improvement, the classifiers were also trained with the default hyperparameters. To validate the genuine impact of the techniques over the use of default hyperparameters, we conducted statistical tests on the accuracy scores of all the optimized classifiers. The results show that the GA is the most effective technique, by improving the accuracy scores of all the classifiers, followed by DE, which improved four of the six classifiers. PSO was the least effective, improving only one classifier. It was also found that GA-optimized Gradient Boosting, LGBM and XGBoost were the best classifiers across all the metrics in predicting phishing websites, achieving peak accuracy scores of 98.98%, 99.24%, and 99.47%, respectively.

Analytics, Vol. 3, Pages 318-343: A Longitudinal Tree-Based Framework for Lapse Management in Life Insurance

Mathias Valla — 2024-08-05

Analytics, Vol. 3, Pages 318-343: A Longitudinal Tree-Based Framework for Lapse Management in Life Insurance

Analytics doi: 10.3390/analytics3030018

Authors: Mathias Valla

Developing an informed lapse management strategy (LMS) is critical for life insurers to improve profitability and gain insight into the risk of their global portfolio. Prior research in actuarial science has shown that targeting policyholders by maximising their individual customer lifetime value is more advantageous than targeting all those likely to lapse. However, most existing lapse analyses do not leverage the variability of features and targets over time. We propose a longitudinal LMS framework, utilising tree-based models for longitudinal data, such as left-truncated and right-censored (LTRC) trees and forests, as well as mixed-effect tree-based models. Our methodology provides time-informed insights, leading to increased precision in targeting. Our findings indicate that the use of longitudinally structured data significantly enhances the precision of models in predicting lapse behaviour, estimating customer lifetime value, and evaluating individual retention gains. The implementation of mixed-effect random forests enables the production of time-varying predictions that are highly relevant for decision-making. This paper contributes to the field of lapse analysis for life insurers by demonstrating the importance of exploiting the complete past trajectory of policyholders, which is often available in insurers’ information systems but has yet to be fully utilised.

Analytics, Vol. 3, Pages 297-317: Enhancing Talent Recruitment in Business Intelligence Systems: A Comparative Analysis of Machine Learning Models

Hikmat Al-Quhfa — 2024-07-15

Analytics, Vol. 3, Pages 297-317: Enhancing Talent Recruitment in Business Intelligence Systems: A Comparative Analysis of Machine Learning Models

Analytics doi: 10.3390/analytics3030017

Authors: Hikmat Al-Quhfa Ali Mothana Abdussalam Aljbri Jie Song

In the competitive field of business intelligence, optimizing talent recruitment through data-driven methodologies is crucial for better decision-making. This study compares the effectiveness of various machine learning models to improve recruitment accuracy and efficiency. Using the recruitment data from a major Yemeni organization (2019–2022), we evaluated models including K-Nearest Neighbors, Logistic Regression, Support Vector Machine, Naive Bayes, Decision Trees, Random Forest, Gradient Boosting Classifier, AdaBoost Classifier, and Neural Networks. Hyperparameter tuning and cross-validation were used for optimization. The Random Forest model achieved the highest accuracy (92.8%), followed by Neural Networks (92.6%) and Gradient Boosting Classifier (92.5%). These results suggest that advanced machine learning models, particularly Random Forest and Neural Networks, can significantly enhance the recruitment processes in business intelligence systems. This study provides valuable insights for recruiters, advocating for the integration of sophisticated machine learning techniques in talent acquisition strategies.

Analytics, Vol. 3, Pages 276-296: Modeling Sea Level Rise Using Ensemble Techniques: Impacts on Coastal Adaptation, Freshwater Ecosystems, Agriculture and Infrastructure

Sambandh Bhusan Dhal — 2024-07-05

Analytics, Vol. 3, Pages 276-296: Modeling Sea Level Rise Using Ensemble Techniques: Impacts on Coastal Adaptation, Freshwater Ecosystems, Agriculture and Infrastructure

Analytics doi: 10.3390/analytics3030016

Authors: Sambandh Bhusan Dhal Rishabh Singh Tushar Pandey Sheelabhadra Dey Stavros Kalafatis Vivekvardhan Kesireddy

Sea level rise (SLR) is a crucial indicator of climate change, primarily driven by greenhouse gas emissions and the subsequent increase in global temperatures. The impact of SLR, however, varies regionally due to factors such as ocean bathymetry, resulting in distinct shifts across different areas compared to the global average. Understanding the complex factors influencing SLR across diverse spatial scales, along with the associated uncertainties, is essential. This study focuses on the East Coast of the United States and Gulf of Mexico, utilizing historical SLR data from 1993 to 2023. To forecast SLR trends from 2024 to 2103, a weighted ensemble model comprising SARIMAX, LSTM, and exponential smoothing models was employed. Additionally, using historical greenhouse gas data, an ensemble of LSTM models was used to predict real-time SLR values, achieving a testing loss of 0.005. Furthermore, conductance and dissolved oxygen (DO) values were assessed for the entire forecasting period, leveraging forecasted SLR trends to evaluate the impacts on marine life, agriculture, and infrastructure.

Analytics, Vol. 3, Pages 255-275: TaskFinder: A Semantics-Based Methodology for Visualization Task Recommendation

Darius Coelho — 2024-07-04

Analytics, Vol. 3, Pages 255-275: TaskFinder: A Semantics-Based Methodology for Visualization Task Recommendation

Analytics doi: 10.3390/analytics3030015

Authors: Darius Coelho Bhavya Ghai Arjun Krishna Maria Velez-Rojas Steve Greenspan Serge Mankovski Klaus Mueller

Data visualization has entered the mainstream, and numerous visualization recommender systems have been proposed to assist visualization novices, as well as busy professionals, in selecting the most appropriate type of chart for their data. Given a dataset and a set of user-defined analytical tasks, these systems can make recommendations based on expert coded visualization design principles or empirical models. However, the need to identify the pertinent analytical tasks beforehand still exists and often requires domain expertise. In this work, we aim to automate this step with TaskFinder, a prototype system that leverages the information available in textual documents to understand domain-specific relations between attributes and tasks. TaskFinder employs word vectors as well as a custom dependency parser along with an expert-defined list of task keywords to extract and rank associations between tasks and attributes. It pairs these associations with a statistical analysis of the dataset to filter out tasks irrelevant given the data. TaskFinder ultimately produces a ranked list of attribute–task pairs. We show that the number of domain articles needed to converge to a recommendation consensus is bounded for our approach. We demonstrate our TaskFinder over multiple domains with varying article types and quantities.

Analytics, Vol. 3, Pages 241-254: Customer Sentiments in Product Reviews: A Comparative Study with GooglePaLM

Olamilekan Shobayo — 2024-06-18

Analytics, Vol. 3, Pages 241-254: Customer Sentiments in Product Reviews: A Comparative Study with GooglePaLM

Analytics doi: 10.3390/analytics3020014

Authors: Olamilekan Shobayo Swethika Sasikumar Sandhya Makkar Obinna Okoyeigbo

In this work, we evaluated the efficacy of Google’s Pathways Language Model (GooglePaLM) in analyzing sentiments expressed in product reviews. Although conventional Natural Language Processing (NLP) techniques such as the rule-based Valence Aware Dictionary for Sentiment Reasoning (VADER) and the long sequence Bidirectional Encoder Representations from Transformers (BERT) model are effective, they frequently encounter difficulties when dealing with intricate linguistic features like sarcasm and contextual nuances commonly found in customer feedback. We performed a sentiment analysis on Amazon’s fashion review datasets using the VADER, BERT, and GooglePaLM models, respectively, and compared the results based on evaluation metrics such as precision, recall, accuracy correct positive prediction, and correct negative prediction. We used the default values of the VADER and BERT models and slightly finetuned GooglePaLM with a Temperature of 0.0 and an N-value of 1. We observed that GooglePaLM performed better with correct positive and negative prediction values of 0.91 and 0.93, respectively, followed by BERT and VADER. We concluded that large language models surpass traditional rule-based systems for natural language processing tasks.

Analytics, Vol. 3, Pages 225-240: Improving the Giant-Armadillo Optimization Method

Glykeria Kyrou — 2024-06-10

Analytics, Vol. 3, Pages 225-240: Improving the Giant-Armadillo Optimization Method

Analytics doi: 10.3390/analytics3020013

Authors: Glykeria Kyrou Vasileios Charilogis Ioannis G. Tsoulos

Global optimization is widely adopted presently in a variety of practical and scientific problems. In this context, a group of widely used techniques are evolutionary techniques. A relatively new evolutionary technique in this direction is that of Giant-Armadillo Optimization, which is based on the hunting strategy of giant armadillos. In this paper, modifications to this technique are proposed, such as the periodic application of a local minimization method as well as the use of modern termination techniques based on statistical observations. The proposed modifications have been tested on a wide series of test functions available from the relevant literature and compared against other evolutionary methods.

Analytics, Vol. 3, Pages 221-224: Beyond the ROC Curve: The IMCP Curve

Jesus S. Aguilar-Ruiz — 2024-05-27

Analytics, Vol. 3, Pages 221-224: Beyond the ROC Curve: The IMCP Curve

Analytics doi: 10.3390/analytics3020012

Authors: Jesus S. Aguilar-Ruiz

The ROC curve [...]

Analytics, Vol. 3, Pages 194-220: Interconnected Markets: Unveiling Volatility Spillovers in Commodities and Energy Markets through BEKK-GARCH Modelling

Tetiana Paientko — 2024-04-16

Analytics, Vol. 3, Pages 194-220: Interconnected Markets: Unveiling Volatility Spillovers in Commodities and Energy Markets through BEKK-GARCH Modelling

Analytics doi: 10.3390/analytics3020011

Authors: Tetiana Paientko Stanley Amakude

Food commodities and energy bills have experienced rapid undulating movements and hikes globally in recent times. This spurred this study to examine the possibility that the shocks that arise from fluctuations of one market spill over to the other and to determine how time-varying the spillovers were across a time. Data were daily frequency (prices of grains and energy products) from 1 July 2019 to 31 December 2022, as quoted in markets. The choice of the period was to capture the COVID pandemic and the Russian–Ukrainian war as events that could impact volatility. The returns were duly calculated using spreadsheets and subjected to ADF stationarity, co-integration, and the full BEKK-GARCH estimation. The results revealed a prolonged association between returns in the energy markets and food commodity market returns. Both markets were found to have volatility persistence individually, and time-varying bidirectional transmission of volatility across the markets was found. No lagged-effects spillover was found from one market to the other. The findings confirm that shocks that emanate from fluctuations in energy markets are impactful on the volatility of prices in food commodity markets and vice versa, but this impact occurs immediately after the shocks arise or on the same day such variation occurs.

Analytics, Vol. 3, Pages 178-193: Learner Engagement and Demographic Influences in Brazilian Massive Open Online Courses: Aprenda Mais Platform Case Study

Júlia Marques Carvalho da Silva — 2024-04-03

Analytics, Vol. 3, Pages 178-193: Learner Engagement and Demographic Influences in Brazilian Massive Open Online Courses: Aprenda Mais Platform Case Study

Analytics doi: 10.3390/analytics3020010

Authors: Júlia Marques Carvalho da Silva Gabriela Hahn Pedroso Augusto Basso Veber Úrsula Gomes Rosa Maruyama

This paper explores the dynamics of student engagement and demographic influences in Massive Open Online Courses (MOOCs). The study analyzes multiple facets of Brazilian MOOC participation, including re-enrollment patterns, course completion rates, and the impact of demographic characteristics on learning outcomes. Using survey data and statistical analyses from the public Aprenda Mais Platform, this study reveals that MOOC learners exhibit a strong tendency toward continuous learning, with a majority re-enrolling in subsequent courses within a short timeframe. The average completion rate across courses is around 42.14%, with learners maintaining consistent academic performance. Demographic factors, notably, race/color and disability, are found to influence enrollment and completion rates, underscoring the importance of inclusive educational practices. Geographical location impacts students’ decision to enroll in and complete courses, highlighting the necessity for region-specific educational strategies. The research concludes that a diverse array of factors, including content interest, personal motivation, and demographic attributes, shape student engagement in MOOCs. These insights are vital for educators and course designers in creating effective, inclusive, and engaging online learning experiences.

Analytics, Vol. 3, Pages 165-177: Optimal Matching with Matching Priority

Massimo Cannas — 2024-03-19

Analytics, Vol. 3, Pages 165-177: Optimal Matching with Matching Priority

Analytics doi: 10.3390/analytics3010009

Authors: Massimo Cannas Emiliano Sironi

Matching algorithms are commonly used to build comparable subsets (matchings) in observational studies. When a complete matching is not possible, some units must necessarily be excluded from the final matching. This may bias the final estimates comparing the two populations, and thus it is important to reduce the number of drops to avoid unsatisfactory results. Greedy matching algorithms may not reach the maximum matching size, thus dropping more units than necessary. Optimal matching algorithms do ensure a maximum matching size, but they implicitly assume that all units have the same matching priority. In this paper, we propose a matching strategy which is order optimal in the sense that it finds a maximum matching size which is consistent with a given matching priority. The strategy is based on an order-optimal matching algorithm originally proposed in connection with assignment problems by D. Gale. When a matching priority is given, the algorithm ensures that the discarded units have the lowest possible matching priority. We discuss the algorithm’s complexity and its relation with classic optimal matching. We illustrate its use with a problem in a case study concerning a comparison of female and male executives and a simulation.

Analytics, Vol. 3, Pages 140-164: Artificial Intelligence and Sustainability—A Review

Rachit Dhiman — 2024-03-01

Analytics, Vol. 3, Pages 140-164: Artificial Intelligence and Sustainability—A Review

Analytics doi: 10.3390/analytics3010008

Authors: Rachit Dhiman Sofia Miteff Yuancheng Wang Shih-Chi Ma Ramila Amirikas Benjamin Fabian

In recent decades, artificial intelligence has undergone transformative advancements, reshaping diverse sectors such as healthcare, transport, agriculture, energy, and the media. Despite the enthusiasm surrounding AI’s potential, concerns persist about its potential negative impacts, including substantial energy consumption and ethical challenges. This paper critically reviews the evolving landscape of AI sustainability, addressing economic, social, and environmental dimensions. The literature is systematically categorized into “Sustainability of AI” and “AI for Sustainability”, revealing a balanced perspective between the two. The study also identifies a notable trend towards holistic approaches, with a surge in publications and empirical studies since 2019, signaling the field’s maturity. Future research directions emphasize delving into the relatively under-explored economic dimension, aligning with the United Nations’ Sustainable Development Goals (SDGs), and addressing stakeholders’ influence.

Analytics, Vol. 3, Pages 116-139: Visual Analytics for Robust Investigations of Placental Aquaporin Gene Expression in Response to Maternal SARS-CoV-2 Infection

Raphael D. Isokpehi — 2024-02-05

Analytics, Vol. 3, Pages 116-139: Visual Analytics for Robust Investigations of Placental Aquaporin Gene Expression in Response to Maternal SARS-CoV-2 Infection

Analytics doi: 10.3390/analytics3010007

Authors: Raphael D. Isokpehi Amos O. Abioye Rickeisha S. Hamilton Jasmin C. Fryer Antoinesha L. Hollman Antoinette M. Destefano Kehinde B. Ezekiel Tyrese L. Taylor Shawna F. Brooks Matilda O. Johnson Olubukola Smile Shirma Ramroop-Butts Angela U. Makolo Albert G. Hayward

The human placenta is a multifunctional, disc-shaped temporary fetal organ that develops in the uterus during pregnancy, connecting the mother and the fetus. The availability of large-scale datasets on the gene expression of placental cell types and scholarly articles documenting adverse pregnancy outcomes from maternal infection warrants the use of computational resources to aid in knowledge generation from disparate data sources. Using maternal Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) infection as a case study in microbial infection, we constructed integrated datasets and implemented visual analytics resources to facilitate robust investigations of placental gene expression data in the dimensions of flow, curation, and analytics. The visual analytics resources and associated datasets can support a greater understanding of SARS-CoV-2-induced changes to the human placental expression levels of 18,882 protein-coding genes and at least 1233 human gene groups/families. We focus this report on the human aquaporin gene family that encodes small integral membrane proteins initially studied for their roles in water transport across cell membranes. Aquaporin-9 (AQP9) was the only aquaporin downregulated in term placental villi from SARS-CoV-2-positive mothers. Previous studies have found that (1) oxygen signaling modulates placental development; (2) oxygen tension could modulate AQP9 expression in the human placenta; and (3) SARS-CoV-2 can disrupt the formation of oxygen-carrying red blood cells in the placenta. Thus, future research could be performed on microbial infection-induced changes to (1) the placental hematopoietic stem and progenitor cells; and (2) placental expression of human aquaporin genes, especially AQP9.

Analytics, Vol. 3, Pages 84-115: Interoperable Information Flow as Enabler for Efficient Predictive Maintenance

Marco Franke — 2024-02-01

Analytics, Vol. 3, Pages 84-115: Interoperable Information Flow as Enabler for Efficient Predictive Maintenance

Analytics doi: 10.3390/analytics3010006

Authors: Marco Franke Quan Deng Zisis Kyroudis Maria Psarodimou Jovana Milenkovic Ioannis Meintanis Dimitris Lokas Stefano Borgia Klaus-Dieter Thoben

Industry 4.0 enables the modernisation of machines and opens up the digitalisation of processes in the manufacturing industry. As a result, these machines are ready for predictive maintenance as part of Industry 4.0 services. The benefit of predictive maintenance is that it can significantly extend the life of machines. The integration of predictive maintenance into existing production environments faces challenges in terms of data understanding and data preparation for machines and legacy systems. Current AI frameworks lack adequate support for the ongoing task of data integration. In this context, adequate support means that the data analyst does not need to know the technical background of the pilot’s data sources in terms of data formats and schemas. It should be possible to perform data analyses without knowing the characteristics of the pilot’s specific data sources. The aim is to achieve a seamless integration of data as information for predictive maintenance. For this purpose, the developed data-sharing infrastructure enables automatic data acquisition and data integration for AI frameworks using interoperability methods. The evaluation, based on two pilot projects, shows that the step of data understanding and data preparation for predictive maintenance is simplified and that the solution is applicable for new pilot projects.

Analytics, Vol. 3, Pages 63-83: Analysing the Influence of Macroeconomic Factors on Credit Risk in the UK Banking Sector

Hemlata Sharma — 2024-01-26

Analytics, Vol. 3, Pages 63-83: Analysing the Influence of Macroeconomic Factors on Credit Risk in the UK Banking Sector

Analytics doi: 10.3390/analytics3010005

Authors: Hemlata Sharma Aparna Andhalkar Oluwaseun Ajao Bayode Ogunleye

Macroeconomic factors have a critical impact on banking credit risk, which cannot be directly controlled by banks, and therefore, there is a need for an early credit risk warning system based on the macroeconomy. By comparing different predictive models (traditional statistical and machine learning algorithms), this study aims to examine the macroeconomic determinants’ impact on the UK banking credit risk and assess the most accurate credit risk estimate using predictive analytics. This study found that the variance-based multi-split decision tree algorithm is the most precise predictive model with interpretable, reliable, and robust results. Our model performance achieved 95% accuracy and evidenced that unemployment and inflation rate are significant credit risk predictors in the UK banking context. Our findings provided valuable insights such as a positive association between credit risk and inflation, the unemployment rate, and national savings, as well as a negative relationship between credit risk and national debt, total trade deficit, and national income. In addition, we empirically showed the relationship between national savings and non-performing loans, thus proving the “paradox of thrift”. These findings benefit the credit risk management team in monitoring the macroeconomic factors’ thresholds and implementing critical reforms to mitigate credit risk.

Analytics, Vol. 3, Pages 46-62: Code Plagiarism Checking Function and Its Application for Code Writing Problem in Java Programming Learning Assistant System

Ei Ei Htet — 2024-01-17

Analytics, Vol. 3, Pages 46-62: Code Plagiarism Checking Function and Its Application for Code Writing Problem in Java Programming Learning Assistant System

Analytics doi: 10.3390/analytics3010004

Authors: Ei Ei Htet Khaing Hsu Wai Soe Thandar Aung Nobuo Funabiki Xiqin Lu Htoo Htoo Sandi Kyaw Wen-Chung Kao

A web-based Java programming learning assistant system (JPLAS) has been developed for novice students to study Java programming by themselves while enhancing code reading and code writing skills. One type of the implemented exercise problem is code writing problem (CWP), which asks students to create a source code that can pass the given test code. The correctness of this answer code is validated by running them on JUnit. In previous works, a Python-based answer code validation program was implemented to assist teachers. It automatically verifies the source codes from all the students for one test code, and reports the number of passed test cases by each code in the CSV file. While this program plays a crucial role in checking the correctness of code behaviors, it cannot detect code plagiarism that can often happen in programming courses. In this paper, we implement a code plagiarism checking function in the answer code validation program, and present its application results to a Java programming course at Okayama University, Japan. This function first removes the whitespace characters and the comments using the regular expressions. Next, it calculates the Levenshtein distance and similarity score for each pair of source codes from different students in the class. If the score is larger than a given threshold, they are regarded as plagiarism. Finally, it outputs the scores as a CSV file with the student IDs. For evaluations, we applied the proposed function to a total of 877 source codes for 45 CWP assignments submitted from 9 to 39 students and analyzed the results. It was found that (1) CWP assignments asking for shorter source codes generate higher scores than those for longer codes due to the use of test codes, (2) proper thresholds are different by assignments, and (3) some students often copied source codes from certain students.

Analytics, Vol. 3, Pages 30-45: An Optimal House Price Prediction Algorithm: XGBoost

Hemlata Sharma — 2024-01-02

Analytics, Vol. 3, Pages 30-45: An Optimal House Price Prediction Algorithm: XGBoost

Analytics doi: 10.3390/analytics3010003

Authors: Hemlata Sharma Hitesh Harsora Bayode Ogunleye

An accurate prediction of house prices is a fundamental requirement for various sectors, including real estate and mortgage lending. It is widely recognized that a property’s value is not solely determined by its physical attributes but is significantly influenced by its surrounding neighborhood. Meeting the diverse housing needs of individuals while balancing budget constraints is a primary concern for real estate developers. To this end, we addressed the house price prediction problem as a regression task and thus employed various machine learning (ML) techniques capable of expressing the significance of independent variables. We made use of the housing dataset of Ames City in Iowa, USA to compare XGBoost, support vector regressor, random forest regressor, multilayer perceptron, and multiple linear regression algorithms for house price prediction. Afterwards, we identified the key factors that influence housing costs. Our results show that XGBoost is the best performing model for house price prediction. Our findings present valuable insights and tools for stakeholders, facilitating more accurate property price estimates and, in turn, enabling more informed decision making to meet the housing needs of diverse populations while considering budget constraints.

Analytics, Vol. 3, Pages 14-29: Exploring Infant Physical Activity Using a Population-Based Network Analysis Approach

Rama Krishna Thelagathoti — 2023-12-31

Analytics, Vol. 3, Pages 14-29: Exploring Infant Physical Activity Using a Population-Based Network Analysis Approach

Analytics doi: 10.3390/analytics3010002

Authors: Rama Krishna Thelagathoti Priyanka Chaudhary Brian Knarr Michaela Schenkelberg Hesham H. Ali Danae Dinkel

Background: Physical activity (PA) is an important aspect of infant development and has been shown to have long-term effects on health and well-being. Accurate analysis of infant PA is crucial for understanding their physical development, monitoring health and wellness, as well as identifying areas for improvement. However, individual analysis of infant PA can be challenging and often leads to biased results due to an infant’s inability to self-report and constantly changing posture and movement. This manuscript explores a population-based network analysis approach to study infants’ PA. The network analysis approach allows us to draw conclusions that are generalizable to the entire population and to identify trends and patterns in PA levels. Methods: This study aims to analyze the PA of infants aged 6–15 months using accelerometer data. A total of 20 infants from different types of childcare settings were recruited, including home-based and center-based care. Each infant wore an accelerometer for four days (2 weekdays, 2 weekend days). Data were analyzed using a network analysis approach, exploring the relationship between PA and various demographic and social factors. Results: The results showed that infants in center-based care have significantly higher levels of PA than those in home-based care. Moreover, the ankle acceleration was much higher than the waist acceleration, and activity patterns differed on weekdays and weekends. Conclusions: This study highlights the need for further research to explore the factors contributing to disparities in PA levels among infants in different childcare settings. Additionally, there is a need to develop effective strategies to promote PA among infants, considering the findings from the network analysis approach. Such efforts can contribute to enhancing infant health and well-being through targeted interventions aimed at increasing PA levels.

Analytics, Vol. 3, Pages 1-13: Does Part of Speech Have an Influence on Cyberbullying Detection?

Jingxiu Huang — 2023-12-21

Analytics, Vol. 3, Pages 1-13: Does Part of Speech Have an Influence on Cyberbullying Detection?

Analytics doi: 10.3390/analytics3010001

Authors: Jingxiu Huang Ruofei Ding Yunxiang Zheng Xiaomin Wu Shumin Chen Xiunan Jin

With the development of the Internet, the issue of cyberbullying on social media has gained significant attention. Cyberbullying is often expressed in text. Methods of identifying such text via machine learning have been growing, most of which rely on the extraction of part-of-speech (POS) tags to improve their performance. However, the current study only arbitrarily used part-of-speech labels that it considered reasonable, without investigating whether the chosen part-of-speech labels can better enhance the effectiveness of the cyberbullying detection task. In other words, the effectiveness of different part-of-speech labels in the automatic cyberbullying detection task was not proven. This study aimed to investigate the part of speech in statements related to cyberbullying and explore how three classification models (random forest, naïve Bayes, and support vector machine) are sensitive to parts of speech in detecting cyberbullying. We also examined which part-of-speech combinations are most appropriate for the models mentioned above. The results of our experiments showed that the predictive performance of different models differs when using different part-of-speech tags as inputs. Random forest showed the best predictive performance, and naive Bayes and support vector machine followed, respectively. Meanwhile, across the different models, the sensitivity to different part-of-speech tags was consistent, with greater sensitivity shown towards nouns, verbs, and measure words, and lower sensitivity shown towards adjectives and pronouns. We also found that the combination of different parts of speech as inputs had an influence on the predictive performance of the models. This study will help researchers to determine which combination of part-of-speech categories is appropriate to improve the accuracy of cyberbullying detection.

Analytics, Vol. 2, Pages 877-898: Learning Analytics in the Era of Large Language Models

Elisabetta Mazzullo — 2023-11-16

Analytics, Vol. 2, Pages 877-898: Learning Analytics in the Era of Large Language Models

Analytics doi: 10.3390/analytics2040046

Authors: Elisabetta Mazzullo Okan Bulut Tarid Wongvorachan Bin Tan

Learning analytics (LA) has the potential to significantly improve teaching and learning, but there are still many areas for improvement in LA research and practice. The literature highlights limitations in every stage of the LA life cycle, including scarce pedagogical grounding and poor design choices in the development of LA, challenges in the implementation of LA with respect to the interpretability of insights, prediction, and actionability of feedback, and lack of generalizability and strong practices in LA evaluation. In this position paper, we advocate for empowering teachers in developing LA solutions. We argue that this would enhance the theoretical basis of LA tools and make them more understandable and practical. We present some instances where process data can be utilized to comprehend learning processes and generate more interpretable LA insights. Additionally, we investigate the potential implementation of large language models (LLMs) in LA to produce comprehensible insights, provide timely and actionable feedback, enhance personalization, and support teachers’ tasks more extensively.

Analytics, Vol. 2, Pages 853-876: A Comparative Analysis of VirLock and Bacteriophage ϕ6 through the Lens of Game Theory

Dimitris Kostadimas — 2023-11-06

Analytics, Vol. 2, Pages 853-876: A Comparative Analysis of VirLock and Bacteriophage ϕ6 through the Lens of Game Theory

Analytics doi: 10.3390/analytics2040045

Authors: Dimitris Kostadimas Kalliopi Kastampolidou Theodore Andronikos

The novelty of this paper lies in its perspective, which underscores the fruitful correlation between biological and computer viruses. In the realm of computer science, the study of theoretical concepts often intersects with practical applications. Computer viruses have many common traits with their biological counterparts. Studying their correlation may enhance our perspective and, ultimately, augment our ability to successfully protect our computer systems and data against viruses. Game theory may be an appropriate tool for establishing the link between biological and computer viruses. In this work, we establish correlations between a well-known computer virus, VirLock, with an equally well-studied biological virus, the bacteriophage ϕ6. VirLock is a formidable ransomware that encrypts user files and demands a ransom for data restoration. Drawing a parallel with the biological virus bacteriophage ϕ6, we uncover conceptual links like shared attributes and behaviors, as well as useful insights. Following this line of thought, we suggest efficient strategies based on a game theory perspective, which have the potential to address the infections caused by VirLock, and other viruses with analogous behavior. Moreover, we propose mathematical formulations that integrate real-world variables, providing a means to gauge virus severity and design robust defensive strategies and analytics. This interdisciplinary inquiry, fusing game theory, biology, and computer science, advances our understanding of virus behavior, paving the way for the development of effective countermeasures while presenting an alternative viewpoint. Throughout this theoretical exploration, we contribute to the ongoing discourse on computer virus behavior and stimulate new avenues for addressing digital threats. In particular, the formulas and framework developed in this work can facilitate better risk analysis and assessment, and become useful tools in penetration testing analysis, helping companies and organizations enhance their security.

Analytics, Vol. 2, Pages 836-852: Can Oral Grades Predict Final Examination Scores? Case Study in a Higher Education Military Academy

Antonios Andreatos — 2023-11-02

Analytics, Vol. 2, Pages 836-852: Can Oral Grades Predict Final Examination Scores? Case Study in a Higher Education Military Academy

Analytics doi: 10.3390/analytics2040044

Authors: Antonios Andreatos Apostolos Leros

This paper investigates the correlation between oral grades and final written examination grades in a higher education military academy. A quantitative, correlational methodology utilizing linear regression analysis is employed. The data consist of undergraduate telecommunications and electronics engineering students’ grades in two courses offered during the fourth year of studies, and spans six academic years. Course One covers period 2017–2022, while Course Two, period 1 spans 2014–2018 and period 2 spans 2019–2022. In Course One oral grades are obtained by means of a midterm exam. In Course Two period 1, 30% of the oral grade comes from homework assignments and lab exercises, while the remaining 70% comes from a midterm exam. In Course Two period 2, oral grades are the result of various alternative assessment activities. In all cases, the final grade results from a traditional written examination given at the end of the semester. Correlation and predictive models between oral and final grades were examined. The results of the analysis demonstrated that, (a) under certain conditions, oral grades based more or less on midterm exams can be good predictors of final examination scores; (b) oral grades obtained through alternative assessment activities cannot predict final examination scores.

Analytics, Vol. 2, Pages 824-835: Relating the Ramsay Quotient Model to the Classical D-Scoring Rule

Alexander Robitzsch — 2023-10-17

Analytics, Vol. 2, Pages 824-835: Relating the Ramsay Quotient Model to the Classical D-Scoring Rule

Analytics doi: 10.3390/analytics2040043

Authors: Alexander Robitzsch

In a series of papers, Dimitrov suggested the classical D-scoring rule for scoring items that give difficult items a higher weight while easier items receive a lower weight. The latent D-scoring model has been proposed to serve as a latent mirror of the classical D-scoring model. However, the item weights implied by this latent D-scoring model are typically only weakly related to the weights in the classical D-scoring model. To this end, this article proposes an alternative item response model, the modified Ramsay quotient model, that is better-suited as a latent mirror of the classical D-scoring model. The reasoning is based on analytical arguments and numerical illustrations.

Analytics, Vol. 2, Pages 809-823: An Exploration of Clustering Algorithms for Customer Segmentation in the UK Retail Market

Jeen Mary John — 2023-10-12

Analytics, Vol. 2, Pages 809-823: An Exploration of Clustering Algorithms for Customer Segmentation in the UK Retail Market

Analytics doi: 10.3390/analytics2040042

Authors: Jeen Mary John Olamilekan Shobayo Bayode Ogunleye

Recently, peoples’ awareness of online purchases has significantly risen. This has given rise to online retail platforms and the need for a better understanding of customer purchasing behaviour. Retail companies are pressed with the need to deal with a high volume of customer purchases, which requires sophisticated approaches to perform more accurate and efficient customer segmentation. Customer segmentation is a marketing analytical tool that aids customer-centric service and thus enhances profitability. In this paper, we aim to develop a customer segmentation model to improve decision-making processes in the retail market industry. To achieve this, we employed a UK-based online retail dataset obtained from the UCI machine learning repository. The retail dataset consists of 541,909 customer records and eight features. Our study adopted the RFM (recency, frequency, and monetary) framework to quantify customer values. Thereafter, we compared several state-of-the-art (SOTA) clustering algorithms, namely, K-means clustering, the Gaussian mixture model (GMM), density-based spatial clustering of applications with noise (DBSCAN), agglomerative clustering, and balanced iterative reducing and clustering using hierarchies (BIRCH). The results showed the GMM outperformed other approaches, with a Silhouette Score of 0.80.

Analytics, Vol. 2, Pages 781-808: A Novel Curve Clustering Method for Functional Data: Applications to COVID-19 and Financial Data

Ting Wei — 2023-10-08

Analytics, Vol. 2, Pages 781-808: A Novel Curve Clustering Method for Functional Data: Applications to COVID-19 and Financial Data

Analytics doi: 10.3390/analytics2040041

Authors: Ting Wei Bo Wang

Functional data analysis has significantly enriched the landscape of existing data analysis methodologies, providing a new framework for comprehending data structures and extracting valuable insights. This paper is dedicated to addressing functional data clustering—a pivotal challenge within functional data analysis. Our contribution to this field manifests through the introduction of innovative clustering methodologies tailored specifically to functional curves. Initially, we present a proximity measure algorithm designed for functional curve clustering. This innovative clustering approach offers the flexibility to redefine measurement points on continuous functions, adapting to either equidistant or nonuniform arrangements, as dictated by the demands of the proximity measure. Central to this method is the “proximity threshold”, a critical parameter that governs the cluster count, and its selection is thoroughly explored. Subsequently, we propose a time-shift clustering algorithm designed for time-series data. This approach identifies historical data segments that share patterns similar to those observed in the present. To evaluate the effectiveness of our methodologies, we conduct comparisons with the classic K-means clustering method and apply them to simulated data, yielding encouraging simulation results. Moving beyond simulation, we apply the proposed proximity measure algorithm to COVID-19 data, yielding notable clustering accuracy. Additionally, the time-shift clustering algorithm is employed to analyse NASDAQ Composite data, successfully revealing underlying economic cycles.

Analytics, Vol. 2, Pages 745-780: Image Segmentation of the Sudd Wetlands in South Sudan for Environmental Analytics by GRASS GIS Scripts

Polina Lemenkova — 2023-09-21

Analytics, Vol. 2, Pages 745-780: Image Segmentation of the Sudd Wetlands in South Sudan for Environmental Analytics by GRASS GIS Scripts

Analytics doi: 10.3390/analytics2030040

Authors: Polina Lemenkova

This paper presents the object detection algorithms GRASS GIS applied for Landsat 8-9 OLI/TIRS data. The study area includes the Sudd wetlands located in South Sudan. This study describes a programming method for the automated processing of satellite images for environmental analytics, applying the scripting algorithms of GRASS GIS. This study documents how the land cover changed and developed over time in South Sudan with varying climate and environmental settings, indicating the variations in landscape patterns. A set of modules was used to process satellite images by scripting language. It streamlines the geospatial processing tasks. The functionality of the modules of GRASS GIS to image processing is called within scripts as subprocesses which automate operations. The cutting-edge tools of GRASS GIS present a cost-effective solution to remote sensing data modelling and analysis. This is based on the discrimination of the spectral reflectance of pixels on the raster scenes. Scripting algorithms of remote sensing data processing based on the GRASS GIS syntax are run from the terminal, enabling to pass commands to the module. This ensures the automation and high speed of image processing. The algorithm challenge is that landscape patterns differ substantially, and there are nonlinear dynamics in land cover types due to environmental factors and climate effects. Time series analysis of several multispectral images demonstrated changes in land cover types over the study area of the Sudd, South Sudan affected by environmental degradation of landscapes. The map is generated for each Landsat image from 2015 to 2023 using 481 maximum-likelihood discriminant analysis approaches of classification. The methodology includes image segmentation by ‘i.segment’ module, image clustering and classification by ‘i.cluster’ and ‘i.maxlike’ modules, accuracy assessment by ‘r.kappa’ module, and computing NDVI and cartographic mapping implemented using GRASS GIS. The benefits of object detection techniques for image analysis are demonstrated with the reported effects of various threshold levels of segmentation. The segmentation was performed 371 times with 90% of the threshold and minsize = 5; the process was converged in 37 to 41 iterations. The following segments are defined for images: 4515 for 2015, 4813 for 2016, 4114 for 2017, 5090 for 2018, 6021 for 2019, 3187 for 2020, 2445 for 2022, and 5181 for 2023. The percent convergence is 98% for the processed images. Detecting variations in land cover patterns is possible using spaceborne datasets and advanced applications of scripting algorithms. The implications of cartographic approach for environmental landscape analysis are discussed. The algorithm for image processing is based on a set of GRASS GIS wrapper functions for automated image classification.

Analytics, Vol. 2, Pages 708-744: Application of Machine Learning and Deep Learning Models in Prostate Cancer Diagnosis Using Medical Images: A Systematic Review

Olusola Olabanjo — 2023-09-19

Analytics, Vol. 2, Pages 708-744: Application of Machine Learning and Deep Learning Models in Prostate Cancer Diagnosis Using Medical Images: A Systematic Review

Analytics doi: 10.3390/analytics2030039

Authors: Olusola Olabanjo Ashiribo Wusu Mauton Asokere Oseni Afisi Basheerat Okugbesan Olufemi Olabanjo Olusegun Folorunso Manuel Mazzara

Introduction: Prostate cancer (PCa) is one of the deadliest and most common causes of malignancy and death in men worldwide, with a higher prevalence and mortality in developing countries specifically. Factors such as age, family history, race and certain genetic mutations are some of the factors contributing to the occurrence of PCa in men. Recent advances in technology and algorithms gave rise to the computer-aided diagnosis (CAD) of PCa. With the availability of medical image datasets and emerging trends in state-of-the-art machine and deep learning techniques, there has been a growth in recent related publications. Materials and Methods: In this study, we present a systematic review of PCa diagnosis with medical images using machine learning and deep learning techniques. We conducted a thorough review of the relevant studies indexed in four databases (IEEE, PubMed, Springer and ScienceDirect) using the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. With well-defined search terms, a total of 608 articles were identified, and 77 met the final inclusion criteria. The key elements in the included papers are presented and conclusions are drawn from them. Results: The findings show that the United States has the most research in PCa diagnosis with machine learning, Magnetic Resonance Images are the most used datasets and transfer learning is the most used method of diagnosing PCa in recent times. In addition, some available PCa datasets and some key considerations for the choice of loss function in the deep learning models are presented. The limitations and lessons learnt are discussed, and some key recommendations are made. Conclusion: The discoveries and the conclusions of this work are organized so as to enable researchers in the same domain to use this work and make crucial implementation decisions.

Analytics, Vol. 2, Pages 694-707: The Use of a Large Language Model for Cyberbullying Detection

Bayode Ogunleye — 2023-09-06

Analytics, Vol. 2, Pages 694-707: The Use of a Large Language Model for Cyberbullying Detection

Analytics doi: 10.3390/analytics2030038

Authors: Bayode Ogunleye Babitha Dharmaraj

The dominance of social media has added to the channels of bullying for perpetrators. Unfortunately, cyberbullying (CB) is the most prevalent phenomenon in today’s cyber world, and is a severe threat to the mental and physical health of citizens. This opens the need to develop a robust system to prevent bullying content from online forums, blogs, and social media platforms to manage the impact in our society. Several machine learning (ML) algorithms have been proposed for this purpose. However, their performances are not consistent due to high class imbalance and generalisation issues. In recent years, large language models (LLMs) like BERT and RoBERTa have achieved state-of-the-art (SOTA) results in several natural language processing (NLP) tasks. Unfortunately, the LLMs have not been applied extensively for CB detection. In our paper, we explored the use of these models for cyberbullying (CB) detection. We have prepared a new dataset (D2) from existing studies (Formspring and Twitter). Our experimental results for dataset D1 and D2 showed that RoBERTa outperformed other models.

Analytics, Vol. 2, Pages 676-693: Heterogeneous Ensemble for Medical Data Classification

Loris Nanni — 2023-09-04

Analytics, Vol. 2, Pages 676-693: Heterogeneous Ensemble for Medical Data Classification

Analytics doi: 10.3390/analytics2030037

Authors: Loris Nanni Sheryl Brahnam Andrea Loreggia Leonardo Barcellona

For robust classification, selecting a proper classifier is of primary importance. However, selecting the best classifiers depends on the problem, as some classifiers work better at some tasks than on others. Despite the many results collected in the literature, the support vector machine (SVM) remains the leading adopted solution in many domains, thanks to its ease of use. In this paper, we propose a new method based on convolutional neural networks (CNNs) as an alternative to SVM. CNNs are specialized in processing data in a grid-like topology that usually represents images. To enable CNNs to work on different data types, we investigate reshaping one-dimensional vector representations into two-dimensional matrices and compared different approaches for feeding standard CNNs using two-dimensional feature vector representations. We evaluate the different techniques proposing a heterogeneous ensemble based on three classifiers: an SVM, a model based on random subspace of rotation boosting (RB), and a CNN. The robustness of our approach is tested across a set of benchmark datasets that represent a wide range of medical classification tasks. The proposed ensembles provide promising performance on all datasets.

Analytics, Vol. 2, Pages 656-675: Surgery Scheduling and Perioperative Care: Smoothing and Visualizing Elective Surgery and Recovery Patient Flow

John S. F. Lyons — 2023-08-21

Analytics, Vol. 2, Pages 656-675: Surgery Scheduling and Perioperative Care: Smoothing and Visualizing Elective Surgery and Recovery Patient Flow

Analytics doi: 10.3390/analytics2030036

Authors: John S. F. Lyons Mehmet A. Begen Peter C. Bell

This paper addresses the practical problem of scheduling operating room (OR) elective surgeries to minimize the likelihood of surgical delays caused by the unavailability of capacity for patient recovery in a central post-anesthesia care unit (PACU). We segregate patients according to their patterns of flow through a multi-stage perioperative system and use characteristics of surgery type and surgeon booking times to predict time intervals for patient procedures and subsequent recoveries. Working with a hospital in which 50+ procedures are performed in 15+ ORs most weekdays, we develop a constraint programming (CP) model that takes the hospital’s elective surgery pre-schedule as input and produces a recommended alternate schedule designed to minimize the expected peak number of patients in the PACU over the course of the day. Our model was developed from the hospital’s data and evaluated through its application to daily schedules during a testing period. Schedules generated by our model indicated the potential to reduce the peak PACU load substantially, 20-30% during most days in our study period, or alternatively reduce average patient flow time by up to 15% given the same PACU peak load. We also developed tools for schedule visualization that can be used to aid management both before and after surgery day; plan PACU resources; propose critical schedule changes; identify the timing, location, and root causes of delay; and to discern the differences in surgical specialty case mixes and their potential impacts on the system. This work is especially timely given high surgical wait times in Ontario which even got worse due to the COVID-19 pandemic.

Analytics, Vol. 2, Pages 618-655: Cyberpsychology: A Longitudinal Analysis of Cyber Adversarial Tactics and Techniques

Marshall S. Rich — 2023-08-11

Analytics, Vol. 2, Pages 618-655: Cyberpsychology: A Longitudinal Analysis of Cyber Adversarial Tactics and Techniques

Analytics doi: 10.3390/analytics2030035

Authors: Marshall S. Rich

The rapid proliferation of cyberthreats necessitates a robust understanding of their evolution and associated tactics, as found in this study. A longitudinal analysis of these threats was conducted, utilizing a six-year data set obtained from a deception network, which emphasized its significance in the study’s primary aim: the exhaustive exploration of the tactics and strategies utilized by cybercriminals and how these tactics and techniques evolved in sophistication and target specificity over time. Different cyberattack instances were dissected and interpreted, with the patterns behind target selection shown. The focus was on unveiling patterns behind target selection and highlighting recurring techniques and emerging trends. The study’s methodological design incorporated data preprocessing, exploratory data analysis, clustering and anomaly detection, temporal analysis, and cross-referencing. The validation process underscored the reliability and robustness of the findings, providing evidence of increasingly sophisticated, targeted cyberattacks. The work identified three distinct network traffic behavior clusters and temporal attack patterns. A validated scoring mechanism provided a benchmark for network anomalies, applicable for predictive analysis and facilitating comparative study of network behaviors. This benchmarking aids organizations in proactively identifying and responding to potential threats. The study significantly contributed to the cybersecurity discourse, offering insights that could guide the development of more effective defense strategies. The need for further investigation into the nature of detected anomalies was acknowledged, advocating for continuous research and proactive defense strategies in the face of the constantly evolving landscape of cyberthreats.

Analytics, Vol. 2, Pages 604-617: Prediction of Stroke Disease with Demographic and Behavioural Data Using Random Forest Algorithm

Olamilekan Shobayo — 2023-08-02

Analytics, Vol. 2, Pages 604-617: Prediction of Stroke Disease with Demographic and Behavioural Data Using Random Forest Algorithm

Analytics doi: 10.3390/analytics2030034

Authors: Olamilekan Shobayo Oluwafemi Zachariah Modupe Olufunke Odusami Bayode Ogunleye

Stroke is a major cause of death worldwide, resulting from a blockage in the flow of blood to different parts of the brain. Many studies have proposed a stroke disease prediction model using medical features applied to deep learning (DL) algorithms to reduce its occurrence. However, these studies pay less attention to the predictors (both demographic and behavioural). Our study considers interpretability, robustness, and generalisation as key themes for deploying algorithms in the medical domain. Based on this background, we propose the use of random forest for stroke incidence prediction. Results from our experiment showed that random forest (RF) outperformed decision tree (DT) and logistic regression (LR) with a macro F1 score of 94%. Our findings indicated age and body mass index (BMI) as the most significant predictors of stroke disease incidence.

Analytics, Vol. 2, Pages 592-603: Identification of Patterns in the Stock Market through Unsupervised Algorithms

Adrian Barradas — 2023-07-27

Analytics, Vol. 2, Pages 592-603: Identification of Patterns in the Stock Market through Unsupervised Algorithms

Analytics doi: 10.3390/analytics2030033

Authors: Adrian Barradas Rosa-Maria Canton-Croda Damian-Emilio Gibaja-Romero

Making predictions in the stock market is a challenging task. At the same time, several studies have focused on forecasting the future behavior of the market and classifying financial assets. A different approach is to classify correlated data to discover patterns and atypical behaviors in them. In this study, we propose applying unsupervised algorithms to process, model, and cluster related data from two different data sources, i.e., Google News and Yahoo Finance, to identify conditions in the stock market that might help to support the investment decision-making process. We applied principal component analysis (PCA) and a k-means clustering approach to group data according to their principal characteristics. We identified four conditions in the stock market, one comprising the least amount of data, characterized by high volatility. The main results show that, regularly, the stock market tends to have a steady performance. However, atypical conditions are conducive to higher volatility.

Analytics, Vol. 2, Pages 577-591: Streamflow Estimation through Coupling of Hieararchical Clustering Analysis and Regression Analysis—A Case Study in Euphrates-Tigris Basin

Goksel Ezgi Guzey — 2023-07-13

Analytics, Vol. 2, Pages 577-591: Streamflow Estimation through Coupling of Hieararchical Clustering Analysis and Regression Analysis—A Case Study in Euphrates-Tigris Basin

Analytics doi: 10.3390/analytics2030032

Authors: Goksel Ezgi Guzey Bihrat Onoz

In this study, the resilience of designed water systems in the face of limited streamflow gauging stations and escalating global warming impacts were investigated. By performing a regression analysis, simulated meteorological data with observed streamflow from 1971 to 2020 across 33 stream gauging stations in the Euphrates-Tigris Basin were correlated. Utilizing the Ordinary Least Squares regression method, streamflow for 2020–2100 using simulated meteorological data under RCP 4.5 and RCP 8.5 scenarios in CORDEX-EURO and CORDEX-MENA domains were also predicted. Streamflow variability was calculated based on meteorological variables and station morphological characteristics, particularly evapotranspiration. Hierarchical clustering analysis identified two clusters among the stream gauging stations, and for each cluster, two streamflow equations were derived. The regression analysis achieved robust streamflow predictions using six representative climate variables, with adj. R2 values of 0.7–0.85 across all models, primarily influenced by evapotranspiration. The use of a global model led to a 10% decrease in prediction capabilities for all CORDEX models based on R2 performance. This study emphasizes the importance of region homogeneity in estimating streamflow, encompassing both geographical and hydro-meteorological characteristics.

Analytics, Vol. 2, Pages 560-576: Hierarchical Model-Based Deep Reinforcement Learning for Single-Asset Trading

Adrian Millea — 2023-07-11

Analytics, Vol. 2, Pages 560-576: Hierarchical Model-Based Deep Reinforcement Learning for Single-Asset Trading

Analytics doi: 10.3390/analytics2030031

Authors: Adrian Millea

We present a hierarchical reinforcement learning (RL) architecture that employs various low-level agents to act in the trading environment, i.e., the market. The highest-level agent selects from among a group of specialized agents, and then the selected agent decides when to sell or buy a single asset for a period of time. This period can be variable according to a termination function. We hypothesized that, due to different market regimes, more than one single agent is needed when trying to learn from such heterogeneous data, and instead, multiple agents will perform better, with each one specializing in a subset of the data. We use k-meansclustering to partition the data and train each agent with a different cluster. Partitioning the input data also helps model-based RL (MBRL), where models can be heterogeneous. We also add two simple decision-making models to the set of low-level agents, diversifying the pool of available agents, and thus increasing overall behavioral flexibility. We perform multiple experiments showing the strengths of a hierarchical approach and test various prediction models at both levels. We also use a risk-based reward at the high level, which transforms the overall problem into a risk-return optimization. This type of reward shows a significant reduction in risk while minimally reducing profits. Overall, the hierarchical approach shows significant promise, especially when the pool of low-level agents is highly diverse. The usefulness of such a system is clear, especially for human-devised strategies, which could be incorporated in a sound manner into larger, powerful automatic systems.

Analytics, Vol. 2, Pages 546-559: occams: A Text Summarization Package

Clinton T. White — 2023-06-30

Analytics, Vol. 2, Pages 546-559: occams: A Text Summarization Package

Analytics doi: 10.3390/analytics2030030

Authors: Clinton T. White Neil P. Molino Julia S. Yang John M. Conroy

Extractive text summarization selects asmall subset of sentences from a document, which gives good “coverage” of a document. When given a set of term weights indicating the importance of the terms, the concept of coverage may be formalized into a combinatorial optimization problem known as the budgeted maximum coverage problem. Extractive methods in this class are known to beamong the best of classic extractive summarization systems. This paper gives a synopsis of thesoftware package occams, which is a multilingual extractive single and multi-document summarization package based on an algorithm giving an optimal approximation to the budgeted maximum coverage problem. The occams package is written in Python and provides an easy-to-use modular interface, allowing it to work in conjunction with popular Python NLP packages, such as nltk, stanza or spacy.

Analytics, Vol. 2, Pages 530-545: Bayesian Mixture Copula Estimation and Selection with Applications

Yujian Liu — 2023-06-15

Analytics, Vol. 2, Pages 530-545: Bayesian Mixture Copula Estimation and Selection with Applications

Analytics doi: 10.3390/analytics2020029

Authors: Yujian Liu Dejun Xie Siyi Yu

Mixture copulas are popular and essential tools for studying complex dependencies among variables. However, selecting the correct mixture models often involves repeated testing and estimations using criteria such as AIC, which could require effort and time. In this paper, we propose a method that would enable us to select and estimate the correct mixture copulas simultaneously. This is accomplished by first overfitting the model and then conducting the Bayesian estimations. We verify the correctness of our approach by numerical simulations. Finally, the real data analysis is performed by studying the dependencies among three major financial markets.

Analytics, Vol. 2, Pages 509-529: Preliminary Perspectives on Information Passing in the Intelligence Community

Jeremy E. Block — 2023-06-15

Analytics, Vol. 2, Pages 509-529: Preliminary Perspectives on Information Passing in the Intelligence Community

Analytics doi: 10.3390/analytics2020028

Authors: Jeremy E. Block Ilana Bookner Sharon Lynn Chu R. Jordan Crouser Donald R. Honeycutt Rebecca M. Jonas Abhishek Kulkarni Yancy Vance Paredes Eric D. Ragan

Analyst sensemaking research typically focuses on individual or small groups conducting intelligence tasks. This has helped understand information retrieval tasks and how people communicate information. As a part of the grand challenge of the Summer Conference on Applied Data Science (SCADS) to build a system that can generate tailored daily reports (TLDR) for intelligence analysts, we conducted a qualitative interview study with analysts to increase understanding of information passing in the intelligence community. While our results are preliminary, we expect that this work will contribute to a better understanding of the information ecosystem of the intelligence community, how institutional dynamics affect information passing, and what implications this has for a TLDR system. This work describes our involvement in and work completed during SCADS. Although preliminary, we identify that information passing is both a formal and informal process and often follows professional networks due especially to the small population and specialization of work. We call attention to the need for future analysis of information ecosystems to better support tailored information retrieval features.

Analytics, Vol. 2, Pages 485-508: Spatiotemporal Data Mining Problems and Methods

Eleftheria Koutsaki — 2023-06-14

Analytics, Vol. 2, Pages 485-508: Spatiotemporal Data Mining Problems and Methods

Analytics doi: 10.3390/analytics2020027

Authors: Eleftheria Koutsaki George Vardakis Nikolaos Papadakis

Many scientific fields show great interest in the extraction and processing of spatiotemporal data, such as medicine with an emphasis on epidemiology and neurology, geology, social sciences, meteorology, and a great interest is also observed in the study of transport. Spatiotemporal data differ significantly from spatial data, since spatiotemporal data refer to measurements, which take into account both the place and the time in which they are received, with their respective characteristics, while spatial data refer to and describe information related only to place. The innovation brought about by spatiotemporal data mining has caused a revolution in many scientific fields, and this is because through it we can now provide solutions and answers to complex problems, as well as provide useful and valuable predictions, through predictive learning. However, combining time and place in data mining presents significant challenges and difficulties that must be overcome. Spatiotemporal data mining and analysis is a relatively new approach to data mining which has been studied more systematically in the last decade. The purpose of this article is to provide a good introduction to spatiotemporal data, and through this detailed description, we attempt to introduce descriptive logic and gain a complete knowledge of these data. We aim to introduce a new way of describing them, aiming for future studies, by combining the expressions that arise by type of data, using descriptive logic, with new expressions, that can be derived, to describe future states of objects and environments with great precision, providing accurate predictions. In order to highlight the value of spatiotemporal data, we proceed to give a brief description of ST data in the introduction. We describe the relevant work carried out to date, the types of spatiotemporal (ST) data, their properties and the transformations that can be made between them, attempting, to a small extent, to introduce constraints and rules using descriptive logic, introducing descriptive logic into spatiotemporal data by type, when initially presenting the ST data. The data snapshots by species and similarities between the cases are then described. We describe methods, introducing clustering, dynamic ST clusters, predictive learning, pattern mining frequency, and pattern emergence, and problems such as anomaly detection, identifying time points of changes in the behavior of the observed object, and development of relationships between them. We describe the application of ST data in various fields today, as well as the future work. We finally conclude with our conclusions, with the representation and study of spatiotemporal data can, in combination with other properties which accompany all natural phenomena, through their appropriate processing, lead to safe conclusions regarding the study of problems, and also with great precision in the extraction of predictions by accurately determining future states of an environment or an object. Thus, the importance of ST data makes them particularly valuable today in various scientific fields, and their extraction is a particularly demanding challenge for the future.

Analytics, Vol. 2, Pages 463-484: A Novel Zero-Truncated Katz Distribution by the Lagrange Expansion of the Second Kind with Associated Inferences

Damodaran Santhamani Shibu — 2023-06-01

Analytics, Vol. 2, Pages 463-484: A Novel Zero-Truncated Katz Distribution by the Lagrange Expansion of the Second Kind with Associated Inferences

Analytics doi: 10.3390/analytics2020026

Authors: Damodaran Santhamani Shibu Christophe Chesneau Mohanan Monisha Radhakumari Maya Muhammed Rasheed Irshad

In this article, the Lagrange expansion of the second kind is used to generate a novel zero-truncated Katz distribution; we refer to it as the Lagrangian zero-truncated Katz distribution (LZTKD). Notably, the zero-truncated Katz distribution is a special case of this distribution. Along with the closed form expression of all its statistical characteristics, the LZTKD is proven to provide an adequate model for both underdispersed and overdispersed zero-truncated count datasets. Specifically, we show that the associated hazard rate function has increasing, decreasing, bathtub, or upside-down bathtub shapes. Moreover, we demonstrate that the LZTKD belongs to the Lagrangian distribution of the first kind. Then, applications of the LZTKD in statistical scenarios are explored. The unknown parameters are estimated using the well-reputed method of the maximum likelihood. In addition, the generalized likelihood ratio test procedure is applied to test the significance of the additional parameter. In order to evaluate the performance of the maximum likelihood estimates, simulation studies are also conducted. The use of real-life datasets further highlights the relevance and applicability of the proposed model.

Analytics, Vol. 2, Pages 438-462: Generalized Unit Half-Logistic Geometric Distribution: Properties and Regression with Applications to Insurance

Suleman Nasiru — 2023-05-16

Analytics, Vol. 2, Pages 438-462: Generalized Unit Half-Logistic Geometric Distribution: Properties and Regression with Applications to Insurance

Analytics doi: 10.3390/analytics2020025

Authors: Suleman Nasiru Christophe Chesneau Abdul Ghaniyyu Abubakari Irene Dekomwine Angbing

The use of distributions to model and quantify risk is essential in risk assessment and management. In this study, the generalized unit half-logistic geometric (GUHLG) distribution is developed to model bounded insurance data on the unit interval. The corresponding probability density function plots indicate that the related distribution can handle data that exhibit left-skewed, right-skewed, symmetric, reversed-J, and bathtub shapes. The hazard rate function also suggests that the distribution can be applied to analyze data with bathtubs, N-shapes, and increasing failure rates. Subsequently, the inferential aspects of the proposed model are investigated. In particular, Monte Carlo simulation exercises are carried out to examine the performance of the estimation method by using an algorithm to generate random observations from the quantile function. The results of the simulation suggest that the considered estimation method is efficient. The univariate application of the distribution and the multivariate application of the associated regression using risk survey data reveal that the model provides a better fit than the other existing distributions and regression models. Under the multivariate application, we estimate the parameters of the regression model using both maximum likelihood and Bayesian estimations. The estimates of the parameters for the two methods are very close. Diagnostic plots of the Bayesian method using the trace, ergodic, and autocorrelation plots reveal that the chains converge to a stationary distribution.

Analytics, Vol. 2, Pages 426-437: Clustering Matrix Variate Longitudinal Count Data

Sanjeena Subedi — 2023-05-05

Analytics, Vol. 2, Pages 426-437: Clustering Matrix Variate Longitudinal Count Data

Analytics doi: 10.3390/analytics2020024

Authors: Sanjeena Subedi

Matrix variate longitudinal discrete data can arise in transcriptomics studies when the data are collected for N genes at r conditions over t time points, and thus, each observation Yn for n=1,…,N can be written as an r×t matrix. When dealing with such data, the number of parameters in the model can be greatly reduced by considering the matrix variate structure. The components of the covariance matrix then also provide a meaningful interpretation. In this work, a mixture of matrix variate Poisson-log normal distributions is introduced for clustering longitudinal read counts from RNA-seq studies. To account for the longitudinal nature of the data, a modified Cholesky-decomposition is utilized for a component of the covariance structure. Furthermore, a parsimonious family of models is developed by imposing constraints on elements of these decompositions. The models are applied to both real and simulated data, and it is demonstrated that the proposed approach can recover the underlying cluster structure.

Analytics, Vol. 2, Pages 410-425: Wavelet Support Vector Censored Regression

Mateus Maia — 2023-05-04

Analytics, Vol. 2, Pages 410-425: Wavelet Support Vector Censored Regression

Analytics doi: 10.3390/analytics2020023

Authors: Mateus Maia Jonatha Sousa Pimentel Raydonal Ospina Anderson Ara

Learning methods in survival analysis have the ability to handle censored observations. The Cox model is a predictive prevalent statistical technique for survival analysis, but its use rests on the strong assumption of hazard proportionality, which can be challenging to verify, particularly when working with non-linearity and high-dimensional data. Therefore, it may be necessary to consider a more flexible and generalizable approach, such as support vector machines. This paper aims to propose a new method, namely wavelet support vector censored regression, and compare the Cox model with traditional support vector regression and traditional support vector regression for censored data models, survival models based on support vector machines. In addition, to evaluate the effectiveness of different kernel functions in the support vector censored regression approach to survival data, we conducted a series of simulations with varying number of observations and ratios of censored data. Based on the simulation results, we found that the wavelet support vector censored regression outperformed the other methods in terms of the C-index. The evaluation was performed on simulations, survival benchmarking datasets and in a biomedical real application.

Analytics, Vol. 2, Pages 393-409: Building Neural Machine Translation Systems for Multilingual Participatory Spaces

Pintu Lohar — 2023-05-01

Analytics, Vol. 2, Pages 393-409: Building Neural Machine Translation Systems for Multilingual Participatory Spaces

Analytics doi: 10.3390/analytics2020022

Authors: Pintu Lohar Guodong Xie Daniel Gallagher Andy Way

This work presents the development of the translation component in a multistage, multilevel, multimode, multilingual and dynamic deliberative (M4D2) system, built to facilitate automated moderation and translation in the languages of five European countries: Italy, Ireland, Germany, France and Poland. Two main topics were to be addressed in the deliberation process: (i) the environment and climate change; and (ii) the economy and inequality. In this work, we describe the development of neural machine translation (NMT) models for these domains for six European languages: Italian, English (included as the second official language of Ireland), Irish, German, French and Polish. As a result, we generate 30 NMT models, initially baseline systems built using freely available online data, which are then adapted to the domains of interest in the project by (i) filtering the corpora, (ii) tuning the systems with automatically extracted in-domain development datasets and (iii) using corpus concatenation techniques to expand the amount of data available. We compare our results produced by the domain-adapted systems with those produced by Google Translate, and demonstrate that fast, high-quality systems can be produced that facilitate multilingual deliberation in a secure environment.

Analytics, Vol. 2, Pages 359-392: Investigating Online Art Search through Quantitative Behavioral Data and Machine Learning Techniques

Minas Pergantis — 2023-04-26

Analytics, Vol. 2, Pages 359-392: Investigating Online Art Search through Quantitative Behavioral Data and Machine Learning Techniques

Analytics doi: 10.3390/analytics2020021

Authors: Minas Pergantis Alexandros Kouretsis Andreas Giannakoulopoulos

Studying searcher behavior has been a cornerstone of search engine research for decades, since it can lead to a better understanding of user needs and allow for an improved user experience. Going beyond descriptive data analysis and statistics, studies have been utilizing the capabilities of Machine Learning to further investigate how users behave during general purpose searching. But the thematic content of a search greatly affects many aspects of user behavior, which often deviates from general purpose search behavior. Thus, in this study, emphasis is placed specifically on the fields of Art and Cultural Heritage. Insights derived from behavioral data can help Culture and Art institutions streamline their online presence and allow them to better understand their user base. Existing research in this field often focuses on lab studies and explicit user feedback, but this study takes advantage of real usage quantitative data and its analysis through machine learning. Using data collected by real world usage of the Art Boulevard proprietary search engine for content related to Art and Culture and through the means of Machine Learning-powered tools and methodologies, this article investigates the peculiarities of Art-related online searches. Through clustering, various archetypes of Art search sessions were identified, thus providing insight on the variety of ways in which users interacted with the search engine. Additionally, using extreme Gradient boosting, the metrics that were more likely to predict the success of a search session were documented, underlining the importance of various aspects of user activity for search success. Finally, through applying topic modeling on the textual information of user-clicked results, the thematic elements that dominated user interest were investigated, providing an overview of prevalent themes in the fields of Art and Culture. It was established that preferred results revolved mostly around traditional visual Art themes, while academic and historical topics also had a strong presence.

Analytics, Vol. 2, Pages 350-358: The AI Learns to Lie to Please You: Preventing Biased Feedback Loops in Machine-Assisted Intelligence Analysis

Jonathan Stray — 2023-04-18

Analytics, Vol. 2, Pages 350-358: The AI Learns to Lie to Please You: Preventing Biased Feedback Loops in Machine-Assisted Intelligence Analysis

Analytics doi: 10.3390/analytics2020020

Authors: Jonathan Stray

Researchers are starting to design AI-powered systems to automatically select and summarize the reports most relevant to each analyst, which raises the issue of bias in the information presented. This article focuses on the selection of relevant reports without an explicit query, a task known as recommendation. Drawing on previous work documenting the existence of human-machine feedback loops in recommender systems, this article reviews potential biases and mitigations in the context of intelligence analysis. Such loops can arise when behavioral “engagement” signals such as clicks or user ratings are used to infer the value of displayed information. Even worse, there can be feedback loops in the collection of intelligence information because users may also be responsible for tasking collection. Avoiding misalignment feedback loops requires an alternate, ongoing, non-engagement signal of information quality. Existing evaluation scales for intelligence product quality and rigor, such as the IC Rating Scale, could provide ground-truth feedback. This sparse data can be used in two ways: for human supervision of average performance and to build models that predict human survey ratings for use at recommendation time. Both techniques are widely used today by social media platforms. Open problems include the design of an ideal human evaluation method, the cost of skilled human labor, and the sparsity of the resulting data.