Data Science and Big Data in Biology, Physical Science and Engineering—2nd Edition

A special issue of Technologies (ISSN 2227-7080). This special issue belongs to the section "Information and Communication Technologies".

Deadline for manuscript submissions: closed (30 September 2025) | Viewed by 42020

Special Issue Editor


E-Mail Website
Guest Editor
Department of Mathematics and Computer Science, School of Applied Sciences, Dickinson State University, 291 Campus Drive, Dickinson, ND 58601, USA
Interests: data science; big data; machine learning; deep learning; artificial intelligence (AI); cybersecurity
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Currently, big data analysis represents one of the most important contemporary areas of development and research. Tremendous amounts of data are generated every single day from digital technologies and modern information systems, such as cloud computing and Internet of Things (IoT) devices. The analysis of these enormous amounts of data has become of crucial significance and requires a great deal of effort in order to extract valuable knowledge for decision-making, which, in turn, will make important contributions in both academia and industry.

Big data and data science have emerged due to the significant need for generating, storing, organising, and processing immense amounts of data. Data scientists strive to use artificial intelligence (AI) and machine learning (ML) approaches and models to enable computers to detect and identify what the data represents and detect patterns more quickly, efficiently, and reliably than humans.

The goal behind this Special Issue is to explore and discuss various principles, tools, and models in the context of data science, aside from the diverse and varied concepts and techniques relating to big data in biology, chemistry, biomedical engineering, physics, mathematics, and other areas that work with big data.

Related SI “Data Science and Big Data in Biology, Physical Science and Engineering”

https://www.mdpi.com/journal/technologies/special_issues/Data_Science_Biology

Dr. Mohammed Mahmoud
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Technologies is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • data science
  • big data
  • machine learning
  • artificial intelligence

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Related Special Issues

Published Papers (9 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

18 pages, 466 KB  
Article
A Novel Dataset for Early Cardiovascular Risk Detection in School Children Using Machine Learning
by Rafael Alejandro Olivera Solís, Emilio Francisco González Rodríguez, Roberto Castañeda Sheissa, Juan Valentín Lorenzo-Ginori and José García
Technologies 2025, 13(6), 222; https://doi.org/10.3390/technologies13060222 - 29 May 2025
Viewed by 2125
Abstract
This study introduces the PROCDEC dataset, a novel collection of 1140 cases with 30 cardiovascular risk factors gathered over a 10-year period from school children in Santa Clara, Cuba. The dataset was curated with input from medical experts in pediatric cardiology, endocrinology, general [...] Read more.
This study introduces the PROCDEC dataset, a novel collection of 1140 cases with 30 cardiovascular risk factors gathered over a 10-year period from school children in Santa Clara, Cuba. The dataset was curated with input from medical experts in pediatric cardiology, endocrinology, general medicine, and clinical laboratory, ensuring its clinical relevance. We conducted a rigorous performance evaluation of 10 machine learning (ML) algorithms to classify cardiovascular risk into two categories: at risk and not at risk. The models were assessed using a stratified k-fold cross-validation approach to enhance the reliability of the findings. Among the evaluated models—Bayes Net, Naive Bayes, SMO, K-Nearest Neighbors (KNN), Logistic Regression, AdaBoost, Multilayer Perceptron (MLP), J48, Logistic Model Tree (LMT), and Random Forest (RF)—the best-performing classifiers (MLP, LMT, J48 and Logistic Regression) achieved F1-score values exceeding 0.83, indicating strong predictive capability. To improve interpretability, we employed feature selection techniques to rank the most influential risk factors. Key contributors to classification performance included hypertension, hyperreactivity, body mass index (BMI), uric acid, cholesterol, parental hypertension, and sibling dyslipidemia. These findings align with established clinical knowledge and reinforce the potential of ML models for pediatric cardiovascular risk assessment. Unlike previous studies, our research not only evaluates multiple ML techniques but also emphasizes their clinical applicability and interpretability, which are critical for real-world implementation. Future work will focus on validating these models with external datasets and integrating them into decision-support systems for early risk detection. Full article
Show Figures

Figure 1

40 pages, 4296 KB  
Article
Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS Under Varying Imbalance Levels
by Mehdi Imani, Ali Beikmohammadi and Hamid Reza Arabnia
Technologies 2025, 13(3), 88; https://doi.org/10.3390/technologies13030088 - 20 Feb 2025
Cited by 47 | Viewed by 17474
Abstract
This study examines the efficacy of Random Forest and XGBoost classifiers in conjunction with three upsampling techniques—SMOTE, ADASYN, and Gaussian noise upsampling (GNUS)—across datasets with varying class imbalance levels, ranging from moderate to extreme (15% to 1% churn rate). Employing metrics such as [...] Read more.
This study examines the efficacy of Random Forest and XGBoost classifiers in conjunction with three upsampling techniques—SMOTE, ADASYN, and Gaussian noise upsampling (GNUS)—across datasets with varying class imbalance levels, ranging from moderate to extreme (15% to 1% churn rate). Employing metrics such as F1 score, ROC AUC, PR AUC, Matthews Correlation Coefficient (MCC), and Cohen’s Kappa, this research provides a comprehensive evaluation of classifier performance under different imbalance scenarios, focusing on applications in the telecommunications domain. The findings highlight that tuned XGBoost paired with SMOTE (Tuned_XGB_SMOTE) consistently achieves the highest F1 score and robust performance across all imbalance levels. SMOTE emerged as the most effective upsampling method, particularly when used with XGBoost, whereas Random Forest performed poorly under severe imbalance. ADASYN showed moderate effectiveness with XGBoost but underperformed with Random Forest, and GNUS produced inconsistent results. This study underscores the impact of data imbalance, with MCC, Kappa, and F1 scores fluctuating significantly, whereas ROC AUC and PR AUC remained relatively stable. Moreover, rigorous statistical analyses employing the Friedman test and Nemenyi post hoc comparisons confirmed that the observed improvements in F1 score, PR-AUC, Kappa, and MCC were statistically significant (p < 0.05), with Tuned_XGB_SMOTE significantly outperforming Tuned_RF_GNUS. While differences in ROC-AUC were not significant, the consistency of these results across multiple performance metrics underscores the reliability of our framework, offering a statistically validated and attractive solution for model selection in imbalanced classification scenarios. Full article
Show Figures

Figure 1

28 pages, 6569 KB  
Article
A New Efficient Hybrid Technique for Human Action Recognition Using 2D Conv-RBM and LSTM with Optimized Frame Selection
by Majid Joudaki, Mehdi Imani and Hamid R. Arabnia
Technologies 2025, 13(2), 53; https://doi.org/10.3390/technologies13020053 - 1 Feb 2025
Cited by 6 | Viewed by 3202
Abstract
Recognizing human actions through video analysis has gained significant attention in applications like surveillance, sports analytics, and human–computer interaction. While deep learning models such as 3D convolutional neural networks (CNNs) and recurrent neural networks (RNNs) deliver promising results, they often struggle with computational [...] Read more.
Recognizing human actions through video analysis has gained significant attention in applications like surveillance, sports analytics, and human–computer interaction. While deep learning models such as 3D convolutional neural networks (CNNs) and recurrent neural networks (RNNs) deliver promising results, they often struggle with computational inefficiencies and inadequate spatial–temporal feature extraction, hindering scalability to larger datasets or high-resolution videos. To address these limitations, we propose a novel model combining a two-dimensional convolutional restricted Boltzmann machine (2D Conv-RBM) with a long short-term memory (LSTM) network. The 2D Conv-RBM efficiently extracts spatial features such as edges, textures, and motion patterns while preserving spatial relationships and reducing parameters via weight sharing. These features are subsequently processed by the LSTM to capture temporal dependencies across frames, enabling effective recognition of both short- and long-term action patterns. Additionally, a smart frame selection mechanism minimizes frame redundancy, significantly lowering computational costs without compromising accuracy. Evaluation on the KTH, UCF Sports, and HMDB51 datasets demonstrated superior performance, achieving accuracies of 97.3%, 94.8%, and 81.5%, respectively. Compared to traditional approaches like 2D RBM and 3D CNN, our method offers notable improvements in both accuracy and computational efficiency, presenting a scalable solution for real-time applications in surveillance, video security, and sports analytics. Full article
Show Figures

Figure 1

13 pages, 708 KB  
Article
Enhancing Decision-Making and Data Management in Healthcare: A Hybrid Ensemble Learning and Blockchain Approach
by Geetanjali Rathee and Razi Iqbal
Technologies 2025, 13(2), 43; https://doi.org/10.3390/technologies13020043 - 23 Jan 2025
Cited by 1 | Viewed by 1842
Abstract
Currently, big data is considered one of the most significant areas of research and development. The advancement in technologies along with the involvement of intelligent and automated devices in each field of development leads to huge generation, analysis, and the recording of information [...] Read more.
Currently, big data is considered one of the most significant areas of research and development. The advancement in technologies along with the involvement of intelligent and automated devices in each field of development leads to huge generation, analysis, and the recording of information in the network. Though a number of schemes have been proposed for providing accurate decision-making while analyzing the records, however, the existing methods lead to massive delays and difficulty in the management of stored information. Furthermore, the excessive delays in information processing pose a critical challenge to making accurate decisions in the context of big data. The aim of this paper is to propose an effective approach for accurate decision-making and analysis of the vast volumes of data generated by intelligent devices in the healthcare sector. The processed and managed records can be stored and accessed in a systematic and efficient manner. The proposed mechanism uses the hybrid of ensemble learning along with blockchain for fast and continuous recording and surveillance of information. The recorded information is analyzed using several existing methods focusing on various measurement outcomes. The results of the proposed technique are compared with existing techniques through various experiments that demonstrate the efficiency and accuracy of this technique. Full article
Show Figures

Figure 1

14 pages, 2385 KB  
Article
Analysis of Multidimensional Clinical and Physiological Data with Synolitical Graph Neural Networks
by Mikhail Krivonosov, Tatiana Nazarenko, Vadim Ushakov, Daniil Vlasenko, Denis Zakharov, Shangbin Chen, Oleg Blyus and Alexey Zaikin
Technologies 2025, 13(1), 13; https://doi.org/10.3390/technologies13010013 - 28 Dec 2024
Cited by 1 | Viewed by 2389
Abstract
This paper introduces a novel approach for classifying multidimensional physiological and clinical data using Synolitic Graph Neural Networks (SGNNs). SGNNs are particularly good for addressing the challenges posed by high-dimensional datasets, particularly in healthcare, where traditional machine learning and Artificial Intelligence methods often [...] Read more.
This paper introduces a novel approach for classifying multidimensional physiological and clinical data using Synolitic Graph Neural Networks (SGNNs). SGNNs are particularly good for addressing the challenges posed by high-dimensional datasets, particularly in healthcare, where traditional machine learning and Artificial Intelligence methods often struggle to find global optima due to the “curse of dimensionality”. To apply Geometric Deep Learning we propose a synolitic or ensemble graph representation of the data, a universal method that transforms any multidimensional dataset into a network, utilising only class labels from training data. The paper demonstrates the effectiveness of this approach through two classification tasks: synthetic and fMRI data from cognitive tasks. Convolutional Graph Neural Network architecture is then applied, and the results are compared with established machine learning algorithms. The findings highlight the robustness and interpretability of SGNNs in solving complex, high-dimensional classification problems. Full article
Show Figures

Figure 1

16 pages, 4393 KB  
Article
A Field-Programmable Gate Array-Based Quasi-Cyclic Low-Density Parity-Check Decoder with High Throughput and Excellent Decoding Performance for 5G New-Radio Standards
by Bilal Mejmaa, Ismail Akharraz and Abdelaziz Ahaitouf
Technologies 2024, 12(11), 215; https://doi.org/10.3390/technologies12110215 - 31 Oct 2024
Cited by 1 | Viewed by 2845
Abstract
This work presents a novel fully parallel decoder architecture designed for high-throughput decoding of Quasi-Cyclic Low-Density Parity-Check (QC-LDPC) codes within the context of 5G New-Radio (NR) communication. The design uses the layered Min-Sum (MS) algorithm and focuses on increasing throughput to meet the [...] Read more.
This work presents a novel fully parallel decoder architecture designed for high-throughput decoding of Quasi-Cyclic Low-Density Parity-Check (QC-LDPC) codes within the context of 5G New-Radio (NR) communication. The design uses the layered Min-Sum (MS) algorithm and focuses on increasing throughput to meet the strict needs of enhanced Mobile BroadBand (eMBB) applications. We incorporated a Sub-Optimal Low-Latency (SOLL) technique to enhance the critical check node processing stage inherent to the MS algorithm. This technique efficiently computes the two minimum values, rendering the architecture well-suited for specific Ultra-Reliable Low-Latency Communication (URLLC) scenarios. We design the decoder to be reconfigurable, enabling efficient operation across all expansion factors. We rigorously validate the decoder’s effectiveness through meticulous bit-error-rate (BER) performance evaluations using Hardware Description Language (HDL) co-simulation. This co-simulation utilizes a well-established suite of tools encompassing MATLAB/Simulink for system modeling and Vivado, a prominent FPGA design suite, for hardware representation. With 380,737 Look-Up Tables (LUTs) and 32,898 registers, the decoder’s implementation on a Virtex-7 XC7VX980T FPGA platform by AMD/Xilinx shows good hardware utilization. The architecture attains a robust operating frequency of 304.5 MHz and a normalized throughput of 49.5 Gbps, marking a 36% enhancement compared to the state-of-the-art. This advancement propels decoding capabilities to meet the demands of high-speed data processing. Full article
Show Figures

Figure 1

22 pages, 2370 KB  
Article
A Hierarchical Machine Learning Method for Detection and Visualization of Network Intrusions from Big Data
by Jinrong Wu, Su Nguyen, Thimal Kempitiya and Damminda Alahakoon
Technologies 2024, 12(10), 204; https://doi.org/10.3390/technologies12100204 - 17 Oct 2024
Viewed by 3033
Abstract
Machine learning is regarded as an effective approach in network intrusion detection, and has gained significant attention in recent studies. However, few intrusion detection methods have been successfully applied to detect anomalies in large-scale network traffic data, and low explainability of the complex [...] Read more.
Machine learning is regarded as an effective approach in network intrusion detection, and has gained significant attention in recent studies. However, few intrusion detection methods have been successfully applied to detect anomalies in large-scale network traffic data, and low explainability of the complex algorithms has caused concerns about fairness and accountability. A further problem is that many intrusion detection systems need to work with distributed data sources in the cloud. In this paper, we propose an intrusion detection method based on distributed computing to learn the latent representations from large-scale network data with lower computation time while improving the intrusion detection accuracy. Our proposed classifier, based on a novel hierarchical algorithm combining adaptability and visualization ability from a self-structured unsupervised learning algorithm and achieving explainability from self-explainable supervised algorithms, is able to enhance the understanding of the model and data. The experimental results show that our proposed method is effective, efficient, and scalable in capturing the network traffic patterns and detecting detailed network intrusion information such as type of attack with high detection performance, and is an ideal method to be applied in cloud-computing environments. Full article
Show Figures

Figure 1

18 pages, 2515 KB  
Article
Discovering Data Domains and Products in Data Meshes Using Semantic Blueprints
by Michalis Pingos and Andreas S. Andreou
Technologies 2024, 12(7), 105; https://doi.org/10.3390/technologies12070105 - 7 Jul 2024
Cited by 1 | Viewed by 2556
Abstract
Nowadays, one of the greatest challenges in data meshes revolves around detecting and creating data domains and data products for providing the ability to adapt easily and quickly to changing business needs. This requires a disciplined approach to identify, differentiate and prioritize distinct [...] Read more.
Nowadays, one of the greatest challenges in data meshes revolves around detecting and creating data domains and data products for providing the ability to adapt easily and quickly to changing business needs. This requires a disciplined approach to identify, differentiate and prioritize distinct data sources according to their content and diversity. The current paper tackles this highly complicated issue and suggests a standardized approach that integrates the concept of data blueprints with data meshes. In essence, a novel standardization framework is proposed that creates data products using a metadata semantic enrichment mechanism, the latter also offering data domain readiness and alignment. The approach is demonstrated using real-world data produced by multiple sources in a poultry meat production factory. A set of functional attributes is used to qualitatively compare the proposed approach to existing data structures utilized in storage architectures, with quite promising results. Finally, experimentation with different scenarios varying in data product complexity and granularity suggests a successful performance. Full article
Show Figures

Figure 1

14 pages, 5243 KB  
Article
Neural Network-Based Body Weight Prediction in Pelibuey Sheep through Biometric Measurements
by Alfonso J. Chay-Canul, Enrique Camacho-Pérez, Fernando Casanova-Lugo, Omar Rodríguez-Abreo, Mayra Cruz-Fernández and Juvenal Rodríguez-Reséndiz
Technologies 2024, 12(5), 59; https://doi.org/10.3390/technologies12050059 - 30 Apr 2024
Cited by 6 | Viewed by 2990
Abstract
This paper presents an intelligent system for the dynamic estimation of sheep body weight (BW). The methodology used to estimate body weight is based on measuring seven biometric parameters: height at withers, rump height, body length, body diagonal length, total body length, semicircumference [...] Read more.
This paper presents an intelligent system for the dynamic estimation of sheep body weight (BW). The methodology used to estimate body weight is based on measuring seven biometric parameters: height at withers, rump height, body length, body diagonal length, total body length, semicircumference of the abdomen, and semicircumference of the girth. A biometric parameter acquisition system was developed using a Kinect as a sensor. The results were contrasted with measurements obtained manually with a flexometer. The comparison gives an average root mean square error (RMSE) of 9.91 and a mean R2 of 0.81. Subsequently, the parameters were used as input in a back-propagation artificial neural network. Performance tests were performed with different combinations to make the best choice of architecture. In this way, an intelligent body weight estimation system was obtained from biometric parameters, with a 5.8% RMSE in the weight estimations for the best architecture. This approach represents an innovative, feasible, and economical alternative to contribute to decision-making in livestock production systems. Full article
Show Figures

Figure 1

Back to TopTop