Applied Sciences

Research

Jump to: Review

22 pages, 3421 KB

Open AccessArticle

High-Precision Depth Estimation Networks Using Low-Resolution Depth and RGB Image Sensors for Low-Cost Mixed Reality Glasses

by Wei-Jong Yang, Hsuan Tsai and Din-Yuen Chan

Appl. Sci. 2025, 15(11), 6169; https://doi.org/10.3390/app15116169 - 30 May 2025

Cited by 1 | Viewed by 2790

Abstract

In recent years, with the booming development of the three-dimensional and mixed reality (MR) industries, depth estimation tools have become increasingly important to support many visual problems. Intrinsically, the restriction of computational resources hardly lets complex depth completion methods be implemented on MR [...] Read more.

In recent years, with the booming development of the three-dimensional and mixed reality (MR) industries, depth estimation tools have become increasingly important to support many visual problems. Intrinsically, the restriction of computational resources hardly lets complex depth completion methods be implemented on MR glasses. In this paper, we propose a competitive high-precision depth estimation network, which integrates a dual-path autoencoder and adaptive bin depth estimator together. The proposed network with different types of models can aptly fuse a pair of an RGB image and low-resolution low-quality depth map to generate a high-precision depth map. The simplest model of the proposed network has only 2.28 M weights and 1.150 G MACs. It is very lightweight, such that this model could be easily implemented into the platform of low-cost MR glasses to support hand-gesture controls realized in edge devices in real time. Full article

(This article belongs to the Special Issue AI, Machine Learning and Deep Learning in Signal Processing, 2nd Edition)

► Show Figures

Figure 1

18 pages, 1051 KB

Open AccessArticle

A Lightweight Received Signal Strength Indicator Estimation Model for Low-Power Internet of Things Devices in Constrained Indoor Networks

by Samrah Arif, M. Arif Khan and Sabih ur Rehman

Appl. Sci. 2025, 15(7), 3535; https://doi.org/10.3390/app15073535 - 24 Mar 2025

Cited by 3 | Viewed by 1679

Abstract

The Internet of Things (IoT) is a revolutionary advancement that automates daily tasks by interacting between digital and physical realms through a network of mostly Low-Power IoT (LP-IoT) devices. For an IoT ecosystem, reliable wireless connectivity is essential to ensure the optimal operation [...] Read more.

The Internet of Things (IoT) is a revolutionary advancement that automates daily tasks by interacting between digital and physical realms through a network of mostly Low-Power IoT (LP-IoT) devices. For an IoT ecosystem, reliable wireless connectivity is essential to ensure the optimal operation of LP-IoT devices, especially considering their limited resource capacity. This reliability is often achieved through channel estimation, an essential aspect for optimising signal transmission. Considering the importance of reliable channel estimation for constrained IoT devices, we developed two lightweight yet effective channel estimation models based on Random Forest Regressor (RFR). These two models are namely classified as Feature-based RFR(F) and Sequence-based RFR(S) methods and utilise Received Signal Strength Indicator (RSSI) as a fundamental channel metric to enhance efficiency for the reliability of channel estimation in constrained LP-IoT devices. The models’ performance was assessed by comparing them with the state-of-the-art and our previously developed Artificial Neural Network (ANN)-based method. The experimental results show that the RFR(F) method shows approximately 39.62% improvement in Mean Squared Error (MSE) over the Feature-based ANN(F) model and 37.86% advancement over the state-of-the-art. Similarly, the RFR(S) model shows an improvement in MSE of 24.9% compared to the Sequence-based ANN(S) model and an 80.59% improvement compared to the leading existing methods. We also evaluated the lightweight characteristics of our RFR(F) and RFR(S) methods by deploying them on Raspberry Pi 4 Model B to demonstrate their practicality for LP-IoT devices. Full article

(This article belongs to the Special Issue AI, Machine Learning and Deep Learning in Signal Processing, 2nd Edition)

► Show Figures

Figure 1

18 pages, 992 KB

Open AccessArticle

Baby Cry Classification Using Structure-Tuned Artificial Neural Networks with Data Augmentation and MFCC Features

by Tayyip Ozcan and Hafize Gungor

Appl. Sci. 2025, 15(5), 2648; https://doi.org/10.3390/app15052648 - 1 Mar 2025

Cited by 9 | Viewed by 6105

Abstract

Babies express their needs, such as hunger, discomfort, or sleeplessness, by crying. However, understanding these cries correctly can be challenging for parents. This can delay the baby’s needs, increase parents’ stress levels, and negatively affect the baby’s development. In this paper, an integrated [...] Read more.

Babies express their needs, such as hunger, discomfort, or sleeplessness, by crying. However, understanding these cries correctly can be challenging for parents. This can delay the baby’s needs, increase parents’ stress levels, and negatively affect the baby’s development. In this paper, an integrated system for the classification of baby sounds is proposed. The proposed method includes data augmentation, feature extraction, hyperparameter tuning, and model training steps. In the first step, various data augmentation techniques were applied to increase the training data’s diversity and strengthen the model’s generalization capacity. The MFCC (Mel-Frequency Cepstral Coefficients) method was used in the second step to extract meaningful and distinctive features from the sound data. MFCC represents sound signals based on the frequencies the human ear perceives and provides a strong basis for classification. The obtained features were classified with an artificial neural network (ANN) model with optimized hyperparameters. The hyperparameter optimization of the model was performed using the grid search algorithm, and the most appropriate parameters were determined. The training, validation, and test data sets were separated at 75%, 10%, and 15% ratios, respectively. The model’s performance was tested on mixed sounds. The test results were analyzed, and the proposed method showed the highest performance, with a 90% accuracy rate. In the comparison study with an artificial neural network (ANN) on the Donate a Cry data set, the F1 score was reported as 46.99% and the test accuracy as 85.93%. In this paper, additional techniques such as data augmentation, hyperparameter tuning, and MFCC feature extraction allowed the model accuracy to reach 90%. The proposed method offers an effective solution for classifying baby sounds and brings a new approach to this field. Full article

(This article belongs to the Special Issue AI, Machine Learning and Deep Learning in Signal Processing, 2nd Edition)

► Show Figures

Figure 1

18 pages, 3998 KB

Open AccessArticle

A Few-Shot Learning-Based Material Recognition Scheme Using Smartphones

by Yeonju Kim, Jeonghyeon Yoon and Seungku Kim

Appl. Sci. 2025, 15(1), 430; https://doi.org/10.3390/app15010430 - 5 Jan 2025

Cited by 2 | Viewed by 2343

Abstract

This study proposes FSMR, a material recognition scheme designed to expand context information about locations in context recognition services. FSMR identifies the material in contact with the smartphone and determines the object based on this information to obtain location data. When the smartphone [...] Read more.

This study proposes FSMR, a material recognition scheme designed to expand context information about locations in context recognition services. FSMR identifies the material in contact with the smartphone and determines the object based on this information to obtain location data. When the smartphone sends vibrations to the object it touches, the vibration signals change according to the unique properties of the material, and the reflected signals are measured using an accelerometer. Based on the fact that the measured sensor values have distinct characteristics for each material, deep learning techniques are applied to classify the material and determine the object. The existing research on material and object recognition using smartphone vibrations and accelerometers often requires vast amounts of training data for deep learning-based models, making it challenging to apply to real-world applications. To address this issue, this study employs few-shot learning and data augmentation to significantly reduce the amount of training data required. The evaluation results show that FSMR achieved classification accuracies of up to 72.03% and 83.63% when trained with data collected over 1 s and 5 s, respectively. Full article

(This article belongs to the Special Issue AI, Machine Learning and Deep Learning in Signal Processing, 2nd Edition)

► Show Figures

Figure 1

18 pages, 6063 KB

Open AccessArticle

Development of Artificial Intelligent-Based Methodology to Prepare Input for Estimating Vehicle Emissions

by Elif Yavuz, Alihan Öztürk, Nedime Gaye Nur Balkanlı, Şeref Naci Engin and S. Levent Kuzu

Appl. Sci. 2024, 14(23), 11175; https://doi.org/10.3390/app142311175 - 29 Nov 2024

Cited by 3 | Viewed by 1625

Abstract

Machine learning has significantly advanced traffic surveillance and management, with YOLO (You Only Look Once) being a prominent Convolutional Neural Network (CNN) algorithm for vehicle detection. This study utilizes YOLO version 7 (YOLOv7) combined with the Kalman-based SORT (Simple Online and Real-time Tracking) [...] Read more.

Machine learning has significantly advanced traffic surveillance and management, with YOLO (You Only Look Once) being a prominent Convolutional Neural Network (CNN) algorithm for vehicle detection. This study utilizes YOLO version 7 (YOLOv7) combined with the Kalman-based SORT (Simple Online and Real-time Tracking) algorithm as one of the models used in our experiments for real-time vehicle identification. We developed the “ISTraffic” dataset. We have also included an overview of existing datasets in the domain of vehicle detection, highlighting their shortcomings: existing vehicle detection datasets often have incomplete annotations and limited diversity, but our “ISTraffic” dataset addresses these issues with detailed and extensive annotations for higher accuracy and robustness. The ISTraffic dataset is meticulously annotated, ensuring high-quality labels for every visible object, including those that are truncated, obscured, or extremely small. With 36,841 annotated examples and an average of 32.7 annotations per image, it offers extensive coverage and dense annotations, making it highly valuable for various object detection and tracking applications. The detailed annotations enhance detection capabilities, enabling the development of more accurate and reliable models for complex environments. This comprehensive dataset is versatile, suitable for applications ranging from autonomous driving to surveillance, and has significantly improved object detection performance, resulting in higher accuracy and robustness in challenging scenarios. Using this dataset, our study achieved significant results with the YOLOv7 model. The model demonstrated high accuracy in detecting various vehicle types, even under challenging conditions. The results highlight the effectiveness of the dataset in training robust vehicle detection models and underscore its potential for future research and development in this field. Our comparative analysis evaluated YOLOv7 against its variants, YOLOv7x and YOLOv7-tiny, using both the “ISTraffic” dataset and the COCO (Common Objects in Context) benchmark. YOLOv7x outperformed others with a mAP@0.5 of 0.87, precision of 0.89, and recall of 0.84, showing a 35% performance improvement over COCO. Performance varied under different conditions, with daytime yielding higher accuracy compared to night-time and rainy weather, where vehicle headlights affected object contours. Despite effective vehicle detection and counting, tracking high-speed vehicles remains a challenge. Additionally, the algorithm’s deep learning estimates of emissions (CO, NO, NO₂, NO_x, PM_2.5, and PM₁₀) were 7.7% to 10.1% lower than ground-truth. Full article

(This article belongs to the Special Issue AI, Machine Learning and Deep Learning in Signal Processing, 2nd Edition)

► Show Figures

Figure 1

23 pages, 720 KB

Open AccessArticle

Beyond xG: A Dual Prediction Model for Analyzing Player Performance Through Expected and Actual Goals in European Soccer Leagues

by Davronbek Malikov and Jaeho Kim

Appl. Sci. 2024, 14(22), 10390; https://doi.org/10.3390/app142210390 - 12 Nov 2024

Cited by 4 | Viewed by 11248

Abstract

Soccer is evolving into a science rather than just a sport, driven by intense competition between professional teams. This transformation requires efforts beyond physical training, including strategic planning, data analysis, and advanced metrics. Coaches and teams increasingly use sophisticated methods and data-driven insights [...] Read more.

Soccer is evolving into a science rather than just a sport, driven by intense competition between professional teams. This transformation requires efforts beyond physical training, including strategic planning, data analysis, and advanced metrics. Coaches and teams increasingly use sophisticated methods and data-driven insights to enhance decision-making. Analyzing team performance is crucial to prepare players and coaches, enabling targeted training and strategic adjustments. Expected goals (xG) analysis plays a key role in assessing team and individual player performance, providing nuanced insights into on-field actions and opportunities. This approach allows coaches to optimize tactics and lineup choices beyond traditional scorelines. However, relying solely on xG might not provide a full picture of player performance, as a higher xG does not always translate into more goals due to the intricacies and variabilities of in-game situations. This paper seeks to refine performance assessments by incorporating predictions for both expected goals (xG) and actual goals (aG). Using this new model, we consider a wider variety of factors to provide a more comprehensive evaluation of players and teams. Another major focus of our study is to present a method for selecting and categorizing players based on their predicted xG and aG performance. Additionally, this paper discusses expected goals and actual goals for each individual game; consequently, we use expected goals per game (xGg) and actual goals per game (aGg) to reflect them. Moreover, we employ regression machine learning models, particularly ridge regression, which demonstrates strong performance in forecasting xGg and aGg, outperforming other models in our comparative assessment. Ridge regression’s ability to handle overlapping and correlated variables makes it an ideal choice for our analysis. This approach improves prediction accuracy and provides actionable insights for coaches and analysts to optimize team performance. By using constructed features from various methods in the dataset, we improve our model’s performance by as much as 12%. These features offer a more detailed understanding of player performance in specific leagues and roles, improving the model’s accuracy from 83% to nearly 95%, as indicated by the R-squared metric. Furthermore, our research introduces a player selection methodology based on their predicted xG and aG, as determined by our proposed model. According to our model’s classification, we categorize top players into two groups: efficient scorers and consistent performers. These precise forecasts can guide strategic decisions, player selection, and training approaches, ultimately enhancing team performance and success. Full article

(This article belongs to the Special Issue AI, Machine Learning and Deep Learning in Signal Processing, 2nd Edition)

► Show Figures

Figure 1

20 pages, 2877 KB

Open AccessArticle

Impact of Sound and Image Features in ASMR on Emotional and Physiological Responses

by Yubin Kim, Ayoung Cho, Hyunwoo Lee and Mincheol Whang

Appl. Sci. 2024, 14(22), 10223; https://doi.org/10.3390/app142210223 - 7 Nov 2024

Cited by 6 | Viewed by 15149

Abstract

As media consumption through electronic devices increases, there is growing interest in ASMR videos, known for inducing relaxation and positive emotional states. However, the effectiveness of ASMR varies depending on each video’s characteristics. This study identifies key sound and image features that evoke [...] Read more.

As media consumption through electronic devices increases, there is growing interest in ASMR videos, known for inducing relaxation and positive emotional states. However, the effectiveness of ASMR varies depending on each video’s characteristics. This study identifies key sound and image features that evoke specific emotional responses. ASMR videos were categorized into two groups: high valence–low relaxation (HVLR) and low valence–high relaxation (LVHR). Subjective evaluations, along with physiological data such as electroencephalography (EEG) and heart rate variability (HRV), were collected from 31 participants to provide objective evidence of emotional and physiological responses. The results showed that both HVLR and LVHR videos can induce relaxation and positive emotions, but the intensity varies depending on the video’s characteristics. LVHR videos have sound frequencies between 50 and 500 Hz, brightness levels of 20 to 30%, and a higher ratio of green to blue. These videos led to 45% greater delta wave activity in the frontal lobe and a tenfold increase in HF HRV, indicating stronger relaxation. HVLR videos feature sound frequencies ranging from 500 to 10,000 Hz, brightness levels of 60 to 70%, and a higher ratio of yellow to green. These videos resulted in 1.2 times higher beta wave activity in the frontal lobe and an increase in LF HRV, indicating greater cognitive engagement and positive arousal. Participants’ subjective reports were consistent with these physiological responses, with LVHR videos evoking feelings of calmness and HVLR videos inducing more vibrant emotions. These findings provide a foundation for creating ASMR content with specific emotional outcomes and offer a framework for researchers to achieve consistent results. By defining sound and image characteristics along with emotional keywords, this study provides practical guidance for content creators and enhances user understanding of ASMR videos. Full article

(This article belongs to the Special Issue AI, Machine Learning and Deep Learning in Signal Processing, 2nd Edition)

► Show Figures

Figure 1

19 pages, 750 KB

Open AccessArticle

SimCDL: A Simple Framework for Contrastive Dictionary Learning

by Denis C. Ilie-Ablachim and Bogdan Dumitrescu

Appl. Sci. 2024, 14(22), 10082; https://doi.org/10.3390/app142210082 - 5 Nov 2024

Cited by 1 | Viewed by 2329

Abstract

In this paper, we propose a novel approach to the dictionary learning (DL) initialization problem, leveraging the SimCLR framework from deep learning in a self-supervised manner. Dictionary learning seeks to represent signals as sparse combinations of dictionary atoms, but effective initialization remains challenging. [...] Read more.

In this paper, we propose a novel approach to the dictionary learning (DL) initialization problem, leveraging the SimCLR framework from deep learning in a self-supervised manner. Dictionary learning seeks to represent signals as sparse combinations of dictionary atoms, but effective initialization remains challenging. By applying contrastive learning, we encourage similar representations for augmented versions of the same sample while distinguishing between different samples. This results in a more diverse and incoherent set of atoms, which enhances the performance of DL applications in classification and anomaly detection tasks. Our experiments across several benchmark datasets demonstrate the effectiveness of our method for improving dictionary learning initialization and its subsequent impact on performance in various applications. Full article

(This article belongs to the Special Issue AI, Machine Learning and Deep Learning in Signal Processing, 2nd Edition)

► Show Figures

Figure 1

17 pages, 2458 KB

Open AccessArticle

Data Augmentation Method Using Room Transfer Function for Monitoring of Domestic Activities

by Minhan Kim and Seokjin Lee

Appl. Sci. 2024, 14(21), 9644; https://doi.org/10.3390/app14219644 - 22 Oct 2024

Viewed by 1557

Abstract

Monitoring domestic activities helps us to understand user behaviors in indoor environments, which has garnered interest as it aids in understanding human activities in context-aware computing. In the field of acoustics, this goal has been achieved through studies employing machine learning techniques, which [...] Read more.

Monitoring domestic activities helps us to understand user behaviors in indoor environments, which has garnered interest as it aids in understanding human activities in context-aware computing. In the field of acoustics, this goal has been achieved through studies employing machine learning techniques, which are widely used for classification tasks involving sound recognition and other objectives. Machine learning typically achieves better performance with large amounts of high-quality training data. Given the high cost of data collection, development datasets often suffer from imbalanced data or lack high-quality samples, leading to performance degradations in machine learning models. The present study aims to address this data issue through data augmentation techniques. Specifically, since the proposed method targets indoor activities in domestic activity detection, room transfer functions were used for data augmentation. The results show that the proposed method achieves a 0.59% improvement in the F1-Score (micro) from that of the baseline system for the development dataset. Additionally, test data including microphones that were not used during training achieved an F1-Score improvement of 0.78% over that of the baseline system. This demonstrates the enhanced model generalization performance of the proposed method on samples having different room transfer functions to those of the trained dataset. Full article

(This article belongs to the Special Issue AI, Machine Learning and Deep Learning in Signal Processing, 2nd Edition)

► Show Figures

Figure 1

15 pages, 11845 KB

Open AccessArticle

Situational Awareness Classification Based on EEG Signals and Spiking Neural Network

by Yakir Hadad, Moshe Bensimon, Yehuda Ben-Shimol and Shlomo Greenberg

Appl. Sci. 2024, 14(19), 8911; https://doi.org/10.3390/app14198911 - 3 Oct 2024

Cited by 5 | Viewed by 3556

Abstract

Situational awareness detection and characterization of mental states have a vital role in medicine and many other fields. An electroencephalogram (EEG) is one of the most effective tools for identifying and analyzing cognitive stress. Yet, the measurement, interpretation, and classification of EEG sensors [...] Read more.

Situational awareness detection and characterization of mental states have a vital role in medicine and many other fields. An electroencephalogram (EEG) is one of the most effective tools for identifying and analyzing cognitive stress. Yet, the measurement, interpretation, and classification of EEG sensors is a challenging task. This study introduces a novel machine learning-based approach to assist in evaluating situational awareness detection using EEG signals and spiking neural networks (SNNs) based on a unique spike continuous-time neuron (SCTN). The implemented biologically inspired SNN architecture is used for effective EEG feature extraction by applying time–frequency analysis techniques and allows adept detection and analysis of the various frequency components embedded in the different EEG sub-bands. The EEG signal undergoes encoding into spikes and is then fed into an SNN model which is well suited to the serial sequence order of the EEG data. We utilize the SCTN-based resonator for EEG feature extraction in the frequency domain which demonstrates high correlation with the classical FFT features. A new SCTN-based 2D neural network is introduced for efficient EEG feature mapping, aiming to achieve a spatial representation of each EEG sub-band. To validate and evaluate the performance of the proposed approach, a common, publicly available EEG dataset is used. The experimental results show that by using the extracted EEG frequencies features and the SCTN-based SNN classifier, the mental state can be accurately classified with an average accuracy of 96.8% for the common EEG dataset. Our proposed method outperforms existing machine learning-based methods and demonstrates the advantages of using SNNs for situational awareness detection and mental state classifications. Full article

(This article belongs to the Special Issue AI, Machine Learning and Deep Learning in Signal Processing, 2nd Edition)

► Show Figures

Figure 1

18 pages, 2376 KB

Open AccessArticle

Markov-Modulated Poisson Process Modeling for Machine-to-Machine Heterogeneous Traffic

by Ahmad Hani El Fawal, Ali Mansour and Abbass Nasser

Appl. Sci. 2024, 14(18), 8561; https://doi.org/10.3390/app14188561 - 23 Sep 2024

Cited by 7 | Viewed by 4503

Abstract

Theoretical mathematics is a key evolution factor of artificial intelligence (AI). Nowadays, representing a smart system as a mathematical model helps to analyze any system under development and supports different case studies found in real life. Additionally, the Markov chain has shown itself [...] Read more.

Theoretical mathematics is a key evolution factor of artificial intelligence (AI). Nowadays, representing a smart system as a mathematical model helps to analyze any system under development and supports different case studies found in real life. Additionally, the Markov chain has shown itself to be an invaluable tool for decision-making systems, natural language processing, and predictive modeling. In an Internet of Things (IoT), Machine-to-Machine (M2M) traffic necessitates new traffic models due to its unique pattern and different goals. In this context, we have two types of modeling: (1) source traffic modeling, used to design stochastic processes so that they match the behavior of physical quantities of measured data traffic (e.g., video, data, voice), and (2) aggregated traffic modeling, which refers to the process of combining multiple small packets into a single packet in order to reduce the header overhead in the network. In IoT studies, balancing the accuracy of the model while managing a large number of M2M devices is a heavy challenge for academia. One the one hand, source traffic models are more competitive than aggregated traffic models because of their dependability. However, their complexity is expected to make managing the exponential growth of M2M devices difficult. In this paper, we propose to use a Markov-Modulated Poisson Process (MMPP) framework to explore Human-to-Human (H2H) traffic and M2M heterogeneous traffic effects. As a tool for stochastic processes, we employ Markov chains to characterize the coexistence of H2H and M2M traffic. Using the traditional evolved Node B (eNodeB), our simulation results show that the network’s service completion rate will suffer significantly. In the worst-case scenario, when an accumulative storm of M2M requests attempts to access the network simultaneously, the degradation reaches 8% as a completion task rate. However, using our “Coexistence of Heterogeneous traffic Analyzer and Network Architecture for Long term evolution” (CHANAL) solution, we can achieve a service completion rate of 96%. Full article

(This article belongs to the Special Issue AI, Machine Learning and Deep Learning in Signal Processing, 2nd Edition)

► Show Figures

Figure 1

26 pages, 7340 KB

Open AccessArticle

Versatile Video Coding-Post Processing Feature Fusion: A Post-Processing Convolutional Neural Network with Progressive Feature Fusion for Efficient Video Enhancement

by Tanni Das, Xilong Liang and Kiho Choi

Appl. Sci. 2024, 14(18), 8276; https://doi.org/10.3390/app14188276 - 13 Sep 2024

Cited by 6 | Viewed by 4256

Abstract

Advanced video codecs such as High Efficiency Video Coding/H.265 (HEVC) and Versatile Video Coding/H.266 (VVC) are vital for streaming high-quality online video content, as they compress and transmit data efficiently. However, these codecs can occasionally degrade video quality by adding undesirable artifacts such [...] Read more.

Advanced video codecs such as High Efficiency Video Coding/H.265 (HEVC) and Versatile Video Coding/H.266 (VVC) are vital for streaming high-quality online video content, as they compress and transmit data efficiently. However, these codecs can occasionally degrade video quality by adding undesirable artifacts such as blockiness, blurriness, and ringing, which can detract from the viewer’s experience. To ensure a seamless and engaging video experience, it is essential to remove these artifacts, which improves viewer comfort and engagement. In this paper, we propose a deep feature fusion based convolutional neural network (CNN) architecture (VVC-PPFF) for post-processing approach to further enhance the performance of VVC. The proposed network, VVC-PPFF, harnesses the power of CNNs to enhance decoded frames, significantly improving the coding efficiency of the state-of-the-art VVC video coding standard. By combining deep features from early and later convolution layers, the network learns to extract both low-level and high-level features, resulting in more generalized outputs that adapt to different quantization parameter (QP) values. The proposed VVC-PPFF network achieves outstanding performance, with Bjøntegaard Delta Rate (BD-Rate) improvements of 5.81% and 6.98% for luma components in random access (RA) and low-delay (LD) configurations, respectively, while also boosting peak signal-to-noise ratio (PSNR). Full article

(This article belongs to the Special Issue AI, Machine Learning and Deep Learning in Signal Processing, 2nd Edition)

► Show Figures

Figure 1

19 pages, 10886 KB

Open AccessArticle

Advancing Nighttime Object Detection through Image Enhancement and Domain Adaptation

by Chenyuan Zhang and Deokwoo Lee

Appl. Sci. 2024, 14(18), 8109; https://doi.org/10.3390/app14188109 - 10 Sep 2024

Cited by 9 | Viewed by 5540

Abstract

Due to the lack of annotations for nighttime low-light images, object detection in low-light images has always been a challenging problem. Achieving high-precision results at night is also an issue. Additionally, we aim to use a single nighttime dataset to complete the knowledge [...] Read more.

Due to the lack of annotations for nighttime low-light images, object detection in low-light images has always been a challenging problem. Achieving high-precision results at night is also an issue. Additionally, we aim to use a single nighttime dataset to complete the knowledge distillation task while improving the detection accuracy of object detection models under nighttime low-light conditions and reducing the computational cost of the model, especially for small targets and objects contaminated by special nighttime lighting. This paper proposes a Nighttime Unsupervised Domain Adaptation Network (NUDN) based on knowledge distillation to address these issues. To improve the detection accuracy of nighttime images, high-confidence bounding box predictions from the teacher and region proposals from the student are first fused, allowing the teacher to perform better in subsequent training, thus generating a combination of high-confidence and low-confidence pseudo-labels. This combination of feature information is used to guide model training, enabling the model to extract feature information similar to that of source images in nighttime low-light images. Nighttime images and pseudo-labels undergo random size transformations before being used as input for the student, enhancing the model’s generalization across different scales. To address the scarcity of nighttime datasets, we propose a nighttime-specific augmentation pipeline called LightImg. This pipeline enhances nighttime features, transforming them into daytime features and reducing issues such as backlighting, uneven illumination, and dim nighttime light, enabling cross-domain research using existing nighttime datasets. Our experimental results show that NUDN can significantly improve nighttime low-light object detection accuracy on the SHIFT and ExDark datasets. We conduct extensive experiments and ablation studies to demonstrate the effectiveness and efficiency of our work. Full article

(This article belongs to the Special Issue AI, Machine Learning and Deep Learning in Signal Processing, 2nd Edition)

► Show Figures

Figure 1

13 pages, 5330 KB

Open AccessArticle

ISAR Imaging Analysis of Complex Aerial Targets Based on Deep Learning

by Yifeng Wang, Jiaxing Hao, Sen Yang and Hongmin Gao

Appl. Sci. 2024, 14(17), 7708; https://doi.org/10.3390/app14177708 - 31 Aug 2024

Cited by 1 | Viewed by 3197

Abstract

Traditional range–instantaneous Doppler (RID) methods for maneuvering target imaging are hindered by issues related to low resolution and inadequate noise suppression. To address this, we propose a novel ISAR imaging method enhanced by deep learning, which incorporates the fundamental architecture of CapsNet along [...] Read more.

Traditional range–instantaneous Doppler (RID) methods for maneuvering target imaging are hindered by issues related to low resolution and inadequate noise suppression. To address this, we propose a novel ISAR imaging method enhanced by deep learning, which incorporates the fundamental architecture of CapsNet along with two additional convolutional layers. Pre-training is conducted through the deep learning network to establish the mapping function for reference. Subsequently, the trained network is integrated into the electromagnetic simulation software, Feko 2019, utilizing a combination of geometric forms such as corner reflectors and Luneberg spheres for analysis. The results indicate that the derived ISAR imaging effectively identifies the ISAR program associated with complex aerial targets. A thorough analysis of the imaging results further corroborates the effectiveness and superiority of this approach. Both simulation and empirical data demonstrate that this method significantly enhances imaging resolution and noise suppression. Full article

(This article belongs to the Special Issue AI, Machine Learning and Deep Learning in Signal Processing, 2nd Edition)

► Show Figures

Figure 1

20 pages, 5395 KB

Open AccessArticle

Detection and Segmentation of Mouth Region in Stereo Stream Using YOLOv6 and DeepLab v3+ Models for Computer-Aided Speech Diagnosis in Children

by Agata Sage and Pawel Badura

Appl. Sci. 2024, 14(16), 7146; https://doi.org/10.3390/app14167146 - 14 Aug 2024

Cited by 9 | Viewed by 2978

Abstract

This paper describes a multistage framework for face image analysis in computer-aided speech diagnosis and therapy. Multimodal data processing frameworks have become a significant factor in supporting speech disorders’ treatment. Synchronous and asynchronous remote speech therapy approaches can use audio and video analysis [...] Read more.

This paper describes a multistage framework for face image analysis in computer-aided speech diagnosis and therapy. Multimodal data processing frameworks have become a significant factor in supporting speech disorders’ treatment. Synchronous and asynchronous remote speech therapy approaches can use audio and video analysis of articulation to deliver robust indicators of disordered speech. Accurate segmentation of articulators in video frames is a vital step in this agenda. We use a dedicated data acquisition system to capture the stereovision stream during speech therapy examination in children. Our goal is to detect and accurately segment four objects in the mouth area (lips, teeth, tongue, and whole mouth) during relaxed speech and speech therapy exercises. Our database contains 17,913 frames from 76 preschool children. We apply a sequence of procedures employing artificial intelligence. For detection, we train the YOLOv6 (you only look once) model to catch each of the three objects under consideration. Then, we prepare the DeepLab v3+ segmentation model in a semi-supervised training mode. As preparation of reliable expert annotations is exhausting in video labeling, we first train the network using weak labels produced by initial segmentation based on the distance-regularized level set evolution over fuzzified images. Next, we fine-tune the model using a portion of manual ground-truth delineations. Each stage is thoroughly assessed using the independent test subset. The lips are detected almost perfectly (average precision and F1 score of 0.999), whereas the segmentation Dice index exceeds 0.83 in each articulator, with a top result of 0.95 in the whole mouth. Full article

(This article belongs to the Special Issue AI, Machine Learning and Deep Learning in Signal Processing, 2nd Edition)

► Show Figures

Figure 1

15 pages, 2056 KB

Open AccessArticle

Robust DOA Estimation Using Multi-Scale Fusion Network with Attention Mask

by Yuting Yan and Qinghua Huang

Appl. Sci. 2024, 14(11), 4488; https://doi.org/10.3390/app14114488 - 24 May 2024

Cited by 1 | Viewed by 2846

Abstract

To overcome the limitations of traditional methods in reverberant and noisy environments, a robust multi-scale fusion neural network with attention mask is designed to improve direction-of-arrival (DOA) estimation accuracy for acoustic sources. It combines the benefits of deep learning and complex-valued operations to [...] Read more.

To overcome the limitations of traditional methods in reverberant and noisy environments, a robust multi-scale fusion neural network with attention mask is designed to improve direction-of-arrival (DOA) estimation accuracy for acoustic sources. It combines the benefits of deep learning and complex-valued operations to effectively deal with the interference of reverberation and noise in speech signals. The unique properties of complex-valued signals are exploited to fully capture inherent features and rich information is preserved in the complex field. An attention mask module is designed to generate distinct masks for selectively focusing and masking based on the input. After that, the multi-scale fusion block efficiently captures multi-scale spatial features by stacking complex-valued convolutional layers with small size kernels, and reduces the module complexity through special branching operations. Experimental results demonstrate that the model achieves significant improvements over other methods for speaker localization in reverberant and noisy environments. It provides a new solution for DOA estimation for acoustic sources in different scenarios, which has significant theoretical and practical implications. Full article

(This article belongs to the Special Issue AI, Machine Learning and Deep Learning in Signal Processing, 2nd Edition)

► Show Figures

Figure 1

Review

Jump to: Research

49 pages, 3154 KB

Open AccessReview

An Investigation into the Utilisation of CNN with LSTM for Video Deepfake Detection

by Sarah Tipper, Hany F. Atlam and Harjinder Singh Lallie

Appl. Sci. 2024, 14(21), 9754; https://doi.org/10.3390/app14219754 - 25 Oct 2024

Cited by 31 | Viewed by 10878

Abstract

Video deepfake detection has emerged as a critical field within the broader domain of digital technologies driven by the rapid proliferation of AI-generated media and the increasing threat of its misuse for deception and misinformation. The integration of Convolutional Neural Network (CNN) with [...] Read more.

Video deepfake detection has emerged as a critical field within the broader domain of digital technologies driven by the rapid proliferation of AI-generated media and the increasing threat of its misuse for deception and misinformation. The integration of Convolutional Neural Network (CNN) with Long Short-Term Memory (LSTM) has proven to be a promising approach for improving video deepfake detection, achieving near-perfect accuracy. CNNs enable the effective extraction of spatial features from video frames, such as facial textures and lighting, while LSTM analyses temporal patterns, detecting inconsistencies over time. This hybrid model enhances the ability to detect deepfakes by combining spatial and temporal analysis. However, the existing research lacks systematic evaluations that comprehensively assess their effectiveness and optimal configurations. Therefore, this paper provides a comprehensive review of video deepfake detection techniques utilising hybrid CNN-LSTM models. It systematically investigates state-of-the-art techniques, highlighting common feature extraction approaches and widely used datasets for training and testing. This paper also evaluates model performance across different datasets, identifies key factors influencing detection accuracy, and explores how CNN-LSTM models can be optimised. It also compares CNN-LSTM models with non-LSTM approaches, addresses implementation challenges, and proposes solutions for them. Lastly, open issues and future research directions of video deepfake detection using CNN-LSTM will be discussed. This paper provides valuable insights for researchers and cyber security professionals by reviewing CNN-LSTM models for video deepfake detection contributing to the advancement of robust and effective deepfake detection systems. Full article

(This article belongs to the Special Issue AI, Machine Learning and Deep Learning in Signal Processing, 2nd Edition)

► Show Figures

Figure 1

Journal Menu

Journal Browser

AI, Machine Learning and Deep Learning in Signal Processing, 2nd Edition

Share This Special Issue

Editors

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Related Special Issue

Published Papers (17 papers)

Research

Review

Further Information

Guidelines

MDPI Initiatives

Follow MDPI