applsci-logo

Journal Browser

Journal Browser

AI, Machine Learning and Deep Learning in Signal Processing, 2nd Edition

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (25 March 2025) | Viewed by 28992

Special Issue Editors


E-Mail Website
Guest Editor
Department of Environmental Sciences, Informatics and Statistics, Ca’ Foscari University of Venice, 30170 Venice, Italy
Interests: computer vision; 3D reconstruction; machine learning; deep learning

Special Issue Information

Dear Colleagues,

Recently, the entire field of signal processing has been facing new challenges and paradigm shifts due to the dramatic improvement of computational performance in hardware and an exponential increase in devices interconnected via the Internet. As a consequence, the tremendous data volume generated by such applications have to be analyzed and processed to provide useful, reliable and meaningful information.

Artificial intelligence (AI), and in particular machine (deep) learning, provides novel tools to be exploited in the field of signal processing. Consequently, new approaches, methods, theories, and tools have to be developed by the signal processing community to analyze and account for generated data volumes.

The Special Issue aims at attracting manuscripts presenting novel methods and innovative applications of AI and machine learning (including deep learning) on topics in the signal processing area. Such topics include (but are not limited to) multimedia systems, audio and video processing, and augmented and virtual reality. The objective of the Special Issue is to bring together recent high-quality works in AI to promote key advances in signal processing areas covered by the journal and to provide reviews of the state of the art in these emerging domains.

Dr. Mara Pistellato
Prof. Dr. Byung-Gyu Kim
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Artificial Intelligence (AI)
  • deep learning
  • machine learning
  • signal processing
  • image and video processing
  • audio and acoustic signal processing
  • biomedical signal processing
  • speech processing
  • multimedia signal processing
  • multidimensional signal processing
  • augmented reality
  • virtual reality

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue policies can be found here.

Related Special Issue

Published Papers (16 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

18 pages, 1051 KiB  
Article
A Lightweight Received Signal Strength Indicator Estimation Model for Low-Power Internet of Things Devices in Constrained Indoor Networks
by Samrah Arif, M. Arif Khan and Sabih ur Rehman
Appl. Sci. 2025, 15(7), 3535; https://doi.org/10.3390/app15073535 - 24 Mar 2025
Viewed by 262
Abstract
The Internet of Things (IoT) is a revolutionary advancement that automates daily tasks by interacting between digital and physical realms through a network of mostly Low-Power IoT (LP-IoT) devices. For an IoT ecosystem, reliable wireless connectivity is essential to ensure the optimal operation [...] Read more.
The Internet of Things (IoT) is a revolutionary advancement that automates daily tasks by interacting between digital and physical realms through a network of mostly Low-Power IoT (LP-IoT) devices. For an IoT ecosystem, reliable wireless connectivity is essential to ensure the optimal operation of LP-IoT devices, especially considering their limited resource capacity. This reliability is often achieved through channel estimation, an essential aspect for optimising signal transmission. Considering the importance of reliable channel estimation for constrained IoT devices, we developed two lightweight yet effective channel estimation models based on Random Forest Regressor (RFR). These two models are namely classified as Feature-based RFR(F) and Sequence-based RFR(S) methods and utilise Received Signal Strength Indicator (RSSI) as a fundamental channel metric to enhance efficiency for the reliability of channel estimation in constrained LP-IoT devices. The models’ performance was assessed by comparing them with the state-of-the-art and our previously developed Artificial Neural Network (ANN)-based method. The experimental results show that the RFR(F) method shows approximately 39.62% improvement in Mean Squared Error (MSE) over the Feature-based ANN(F) model and 37.86% advancement over the state-of-the-art. Similarly, the RFR(S) model shows an improvement in MSE of 24.9% compared to the Sequence-based ANN(S) model and an 80.59% improvement compared to the leading existing methods. We also evaluated the lightweight characteristics of our RFR(F) and RFR(S) methods by deploying them on Raspberry Pi 4 Model B to demonstrate their practicality for LP-IoT devices. Full article
Show Figures

Figure 1

18 pages, 992 KiB  
Article
Baby Cry Classification Using Structure-Tuned Artificial Neural Networks with Data Augmentation and MFCC Features
by Tayyip Ozcan and Hafize Gungor
Appl. Sci. 2025, 15(5), 2648; https://doi.org/10.3390/app15052648 - 1 Mar 2025
Viewed by 946
Abstract
Babies express their needs, such as hunger, discomfort, or sleeplessness, by crying. However, understanding these cries correctly can be challenging for parents. This can delay the baby’s needs, increase parents’ stress levels, and negatively affect the baby’s development. In this paper, an integrated [...] Read more.
Babies express their needs, such as hunger, discomfort, or sleeplessness, by crying. However, understanding these cries correctly can be challenging for parents. This can delay the baby’s needs, increase parents’ stress levels, and negatively affect the baby’s development. In this paper, an integrated system for the classification of baby sounds is proposed. The proposed method includes data augmentation, feature extraction, hyperparameter tuning, and model training steps. In the first step, various data augmentation techniques were applied to increase the training data’s diversity and strengthen the model’s generalization capacity. The MFCC (Mel-Frequency Cepstral Coefficients) method was used in the second step to extract meaningful and distinctive features from the sound data. MFCC represents sound signals based on the frequencies the human ear perceives and provides a strong basis for classification. The obtained features were classified with an artificial neural network (ANN) model with optimized hyperparameters. The hyperparameter optimization of the model was performed using the grid search algorithm, and the most appropriate parameters were determined. The training, validation, and test data sets were separated at 75%, 10%, and 15% ratios, respectively. The model’s performance was tested on mixed sounds. The test results were analyzed, and the proposed method showed the highest performance, with a 90% accuracy rate. In the comparison study with an artificial neural network (ANN) on the Donate a Cry data set, the F1 score was reported as 46.99% and the test accuracy as 85.93%. In this paper, additional techniques such as data augmentation, hyperparameter tuning, and MFCC feature extraction allowed the model accuracy to reach 90%. The proposed method offers an effective solution for classifying baby sounds and brings a new approach to this field. Full article
Show Figures

Figure 1

18 pages, 3998 KiB  
Article
A Few-Shot Learning-Based Material Recognition Scheme Using Smartphones
by Yeonju Kim, Jeonghyeon Yoon and Seungku Kim
Appl. Sci. 2025, 15(1), 430; https://doi.org/10.3390/app15010430 - 5 Jan 2025
Viewed by 905
Abstract
This study proposes FSMR, a material recognition scheme designed to expand context information about locations in context recognition services. FSMR identifies the material in contact with the smartphone and determines the object based on this information to obtain location data. When the smartphone [...] Read more.
This study proposes FSMR, a material recognition scheme designed to expand context information about locations in context recognition services. FSMR identifies the material in contact with the smartphone and determines the object based on this information to obtain location data. When the smartphone sends vibrations to the object it touches, the vibration signals change according to the unique properties of the material, and the reflected signals are measured using an accelerometer. Based on the fact that the measured sensor values have distinct characteristics for each material, deep learning techniques are applied to classify the material and determine the object. The existing research on material and object recognition using smartphone vibrations and accelerometers often requires vast amounts of training data for deep learning-based models, making it challenging to apply to real-world applications. To address this issue, this study employs few-shot learning and data augmentation to significantly reduce the amount of training data required. The evaluation results show that FSMR achieved classification accuracies of up to 72.03% and 83.63% when trained with data collected over 1 s and 5 s, respectively. Full article
Show Figures

Figure 1

18 pages, 6063 KiB  
Article
Development of Artificial Intelligent-Based Methodology to Prepare Input for Estimating Vehicle Emissions
by Elif Yavuz, Alihan Öztürk, Nedime Gaye Nur Balkanlı, Şeref Naci Engin and S. Levent Kuzu
Appl. Sci. 2024, 14(23), 11175; https://doi.org/10.3390/app142311175 - 29 Nov 2024
Viewed by 728
Abstract
Machine learning has significantly advanced traffic surveillance and management, with YOLO (You Only Look Once) being a prominent Convolutional Neural Network (CNN) algorithm for vehicle detection. This study utilizes YOLO version 7 (YOLOv7) combined with the Kalman-based SORT (Simple Online and Real-time Tracking) [...] Read more.
Machine learning has significantly advanced traffic surveillance and management, with YOLO (You Only Look Once) being a prominent Convolutional Neural Network (CNN) algorithm for vehicle detection. This study utilizes YOLO version 7 (YOLOv7) combined with the Kalman-based SORT (Simple Online and Real-time Tracking) algorithm as one of the models used in our experiments for real-time vehicle identification. We developed the “ISTraffic” dataset. We have also included an overview of existing datasets in the domain of vehicle detection, highlighting their shortcomings: existing vehicle detection datasets often have incomplete annotations and limited diversity, but our “ISTraffic” dataset addresses these issues with detailed and extensive annotations for higher accuracy and robustness. The ISTraffic dataset is meticulously annotated, ensuring high-quality labels for every visible object, including those that are truncated, obscured, or extremely small. With 36,841 annotated examples and an average of 32.7 annotations per image, it offers extensive coverage and dense annotations, making it highly valuable for various object detection and tracking applications. The detailed annotations enhance detection capabilities, enabling the development of more accurate and reliable models for complex environments. This comprehensive dataset is versatile, suitable for applications ranging from autonomous driving to surveillance, and has significantly improved object detection performance, resulting in higher accuracy and robustness in challenging scenarios. Using this dataset, our study achieved significant results with the YOLOv7 model. The model demonstrated high accuracy in detecting various vehicle types, even under challenging conditions. The results highlight the effectiveness of the dataset in training robust vehicle detection models and underscore its potential for future research and development in this field. Our comparative analysis evaluated YOLOv7 against its variants, YOLOv7x and YOLOv7-tiny, using both the “ISTraffic” dataset and the COCO (Common Objects in Context) benchmark. YOLOv7x outperformed others with a mAP@0.5 of 0.87, precision of 0.89, and recall of 0.84, showing a 35% performance improvement over COCO. Performance varied under different conditions, with daytime yielding higher accuracy compared to night-time and rainy weather, where vehicle headlights affected object contours. Despite effective vehicle detection and counting, tracking high-speed vehicles remains a challenge. Additionally, the algorithm’s deep learning estimates of emissions (CO, NO, NO2, NOx, PM2.5, and PM10) were 7.7% to 10.1% lower than ground-truth. Full article
Show Figures

Figure 1

23 pages, 720 KiB  
Article
Beyond xG: A Dual Prediction Model for Analyzing Player Performance Through Expected and Actual Goals in European Soccer Leagues
by Davronbek Malikov and Jaeho Kim
Appl. Sci. 2024, 14(22), 10390; https://doi.org/10.3390/app142210390 - 12 Nov 2024
Viewed by 3572
Abstract
Soccer is evolving into a science rather than just a sport, driven by intense competition between professional teams. This transformation requires efforts beyond physical training, including strategic planning, data analysis, and advanced metrics. Coaches and teams increasingly use sophisticated methods and data-driven insights [...] Read more.
Soccer is evolving into a science rather than just a sport, driven by intense competition between professional teams. This transformation requires efforts beyond physical training, including strategic planning, data analysis, and advanced metrics. Coaches and teams increasingly use sophisticated methods and data-driven insights to enhance decision-making. Analyzing team performance is crucial to prepare players and coaches, enabling targeted training and strategic adjustments. Expected goals (xG) analysis plays a key role in assessing team and individual player performance, providing nuanced insights into on-field actions and opportunities. This approach allows coaches to optimize tactics and lineup choices beyond traditional scorelines. However, relying solely on xG might not provide a full picture of player performance, as a higher xG does not always translate into more goals due to the intricacies and variabilities of in-game situations. This paper seeks to refine performance assessments by incorporating predictions for both expected goals (xG) and actual goals (aG). Using this new model, we consider a wider variety of factors to provide a more comprehensive evaluation of players and teams. Another major focus of our study is to present a method for selecting and categorizing players based on their predicted xG and aG performance. Additionally, this paper discusses expected goals and actual goals for each individual game; consequently, we use expected goals per game (xGg) and actual goals per game (aGg) to reflect them. Moreover, we employ regression machine learning models, particularly ridge regression, which demonstrates strong performance in forecasting xGg and aGg, outperforming other models in our comparative assessment. Ridge regression’s ability to handle overlapping and correlated variables makes it an ideal choice for our analysis. This approach improves prediction accuracy and provides actionable insights for coaches and analysts to optimize team performance. By using constructed features from various methods in the dataset, we improve our model’s performance by as much as 12%. These features offer a more detailed understanding of player performance in specific leagues and roles, improving the model’s accuracy from 83% to nearly 95%, as indicated by the R-squared metric. Furthermore, our research introduces a player selection methodology based on their predicted xG and aG, as determined by our proposed model. According to our model’s classification, we categorize top players into two groups: efficient scorers and consistent performers. These precise forecasts can guide strategic decisions, player selection, and training approaches, ultimately enhancing team performance and success. Full article
Show Figures

Figure 1

20 pages, 2877 KiB  
Article
Impact of Sound and Image Features in ASMR on Emotional and Physiological Responses
by Yubin Kim, Ayoung Cho, Hyunwoo Lee and Mincheol Whang
Appl. Sci. 2024, 14(22), 10223; https://doi.org/10.3390/app142210223 - 7 Nov 2024
Viewed by 2846
Abstract
As media consumption through electronic devices increases, there is growing interest in ASMR videos, known for inducing relaxation and positive emotional states. However, the effectiveness of ASMR varies depending on each video’s characteristics. This study identifies key sound and image features that evoke [...] Read more.
As media consumption through electronic devices increases, there is growing interest in ASMR videos, known for inducing relaxation and positive emotional states. However, the effectiveness of ASMR varies depending on each video’s characteristics. This study identifies key sound and image features that evoke specific emotional responses. ASMR videos were categorized into two groups: high valence–low relaxation (HVLR) and low valence–high relaxation (LVHR). Subjective evaluations, along with physiological data such as electroencephalography (EEG) and heart rate variability (HRV), were collected from 31 participants to provide objective evidence of emotional and physiological responses. The results showed that both HVLR and LVHR videos can induce relaxation and positive emotions, but the intensity varies depending on the video’s characteristics. LVHR videos have sound frequencies between 50 and 500 Hz, brightness levels of 20 to 30%, and a higher ratio of green to blue. These videos led to 45% greater delta wave activity in the frontal lobe and a tenfold increase in HF HRV, indicating stronger relaxation. HVLR videos feature sound frequencies ranging from 500 to 10,000 Hz, brightness levels of 60 to 70%, and a higher ratio of yellow to green. These videos resulted in 1.2 times higher beta wave activity in the frontal lobe and an increase in LF HRV, indicating greater cognitive engagement and positive arousal. Participants’ subjective reports were consistent with these physiological responses, with LVHR videos evoking feelings of calmness and HVLR videos inducing more vibrant emotions. These findings provide a foundation for creating ASMR content with specific emotional outcomes and offer a framework for researchers to achieve consistent results. By defining sound and image characteristics along with emotional keywords, this study provides practical guidance for content creators and enhances user understanding of ASMR videos. Full article
Show Figures

Figure 1

19 pages, 750 KiB  
Article
SimCDL: A Simple Framework for Contrastive Dictionary Learning
by Denis C. Ilie-Ablachim and Bogdan Dumitrescu
Appl. Sci. 2024, 14(22), 10082; https://doi.org/10.3390/app142210082 - 5 Nov 2024
Viewed by 1149
Abstract
In this paper, we propose a novel approach to the dictionary learning (DL) initialization problem, leveraging the SimCLR framework from deep learning in a self-supervised manner. Dictionary learning seeks to represent signals as sparse combinations of dictionary atoms, but effective initialization remains challenging. [...] Read more.
In this paper, we propose a novel approach to the dictionary learning (DL) initialization problem, leveraging the SimCLR framework from deep learning in a self-supervised manner. Dictionary learning seeks to represent signals as sparse combinations of dictionary atoms, but effective initialization remains challenging. By applying contrastive learning, we encourage similar representations for augmented versions of the same sample while distinguishing between different samples. This results in a more diverse and incoherent set of atoms, which enhances the performance of DL applications in classification and anomaly detection tasks. Our experiments across several benchmark datasets demonstrate the effectiveness of our method for improving dictionary learning initialization and its subsequent impact on performance in various applications. Full article
Show Figures

Figure 1

17 pages, 2458 KiB  
Article
Data Augmentation Method Using Room Transfer Function for Monitoring of Domestic Activities
by Minhan Kim and Seokjin Lee
Appl. Sci. 2024, 14(21), 9644; https://doi.org/10.3390/app14219644 - 22 Oct 2024
Viewed by 943
Abstract
Monitoring domestic activities helps us to understand user behaviors in indoor environments, which has garnered interest as it aids in understanding human activities in context-aware computing. In the field of acoustics, this goal has been achieved through studies employing machine learning techniques, which [...] Read more.
Monitoring domestic activities helps us to understand user behaviors in indoor environments, which has garnered interest as it aids in understanding human activities in context-aware computing. In the field of acoustics, this goal has been achieved through studies employing machine learning techniques, which are widely used for classification tasks involving sound recognition and other objectives. Machine learning typically achieves better performance with large amounts of high-quality training data. Given the high cost of data collection, development datasets often suffer from imbalanced data or lack high-quality samples, leading to performance degradations in machine learning models. The present study aims to address this data issue through data augmentation techniques. Specifically, since the proposed method targets indoor activities in domestic activity detection, room transfer functions were used for data augmentation. The results show that the proposed method achieves a 0.59% improvement in the F1-Score (micro) from that of the baseline system for the development dataset. Additionally, test data including microphones that were not used during training achieved an F1-Score improvement of 0.78% over that of the baseline system. This demonstrates the enhanced model generalization performance of the proposed method on samples having different room transfer functions to those of the trained dataset. Full article
Show Figures

Figure 1

15 pages, 11845 KiB  
Article
Situational Awareness Classification Based on EEG Signals and Spiking Neural Network
by Yakir Hadad, Moshe Bensimon, Yehuda Ben-Shimol and Shlomo Greenberg
Appl. Sci. 2024, 14(19), 8911; https://doi.org/10.3390/app14198911 - 3 Oct 2024
Cited by 1 | Viewed by 1483
Abstract
Situational awareness detection and characterization of mental states have a vital role in medicine and many other fields. An electroencephalogram (EEG) is one of the most effective tools for identifying and analyzing cognitive stress. Yet, the measurement, interpretation, and classification of EEG sensors [...] Read more.
Situational awareness detection and characterization of mental states have a vital role in medicine and many other fields. An electroencephalogram (EEG) is one of the most effective tools for identifying and analyzing cognitive stress. Yet, the measurement, interpretation, and classification of EEG sensors is a challenging task. This study introduces a novel machine learning-based approach to assist in evaluating situational awareness detection using EEG signals and spiking neural networks (SNNs) based on a unique spike continuous-time neuron (SCTN). The implemented biologically inspired SNN architecture is used for effective EEG feature extraction by applying time–frequency analysis techniques and allows adept detection and analysis of the various frequency components embedded in the different EEG sub-bands. The EEG signal undergoes encoding into spikes and is then fed into an SNN model which is well suited to the serial sequence order of the EEG data. We utilize the SCTN-based resonator for EEG feature extraction in the frequency domain which demonstrates high correlation with the classical FFT features. A new SCTN-based 2D neural network is introduced for efficient EEG feature mapping, aiming to achieve a spatial representation of each EEG sub-band. To validate and evaluate the performance of the proposed approach, a common, publicly available EEG dataset is used. The experimental results show that by using the extracted EEG frequencies features and the SCTN-based SNN classifier, the mental state can be accurately classified with an average accuracy of 96.8% for the common EEG dataset. Our proposed method outperforms existing machine learning-based methods and demonstrates the advantages of using SNNs for situational awareness detection and mental state classifications. Full article
Show Figures

Figure 1

18 pages, 2376 KiB  
Article
Markov-Modulated Poisson Process Modeling for Machine-to-Machine Heterogeneous Traffic
by Ahmad Hani El Fawal, Ali Mansour and Abbass Nasser
Appl. Sci. 2024, 14(18), 8561; https://doi.org/10.3390/app14188561 - 23 Sep 2024
Viewed by 1481
Abstract
Theoretical mathematics is a key evolution factor of artificial intelligence (AI). Nowadays, representing a smart system as a mathematical model helps to analyze any system under development and supports different case studies found in real life. Additionally, the Markov chain has shown itself [...] Read more.
Theoretical mathematics is a key evolution factor of artificial intelligence (AI). Nowadays, representing a smart system as a mathematical model helps to analyze any system under development and supports different case studies found in real life. Additionally, the Markov chain has shown itself to be an invaluable tool for decision-making systems, natural language processing, and predictive modeling. In an Internet of Things (IoT), Machine-to-Machine (M2M) traffic necessitates new traffic models due to its unique pattern and different goals. In this context, we have two types of modeling: (1) source traffic modeling, used to design stochastic processes so that they match the behavior of physical quantities of measured data traffic (e.g., video, data, voice), and (2) aggregated traffic modeling, which refers to the process of combining multiple small packets into a single packet in order to reduce the header overhead in the network. In IoT studies, balancing the accuracy of the model while managing a large number of M2M devices is a heavy challenge for academia. One the one hand, source traffic models are more competitive than aggregated traffic models because of their dependability. However, their complexity is expected to make managing the exponential growth of M2M devices difficult. In this paper, we propose to use a Markov-Modulated Poisson Process (MMPP) framework to explore Human-to-Human (H2H) traffic and M2M heterogeneous traffic effects. As a tool for stochastic processes, we employ Markov chains to characterize the coexistence of H2H and M2M traffic. Using the traditional evolved Node B (eNodeB), our simulation results show that the network’s service completion rate will suffer significantly. In the worst-case scenario, when an accumulative storm of M2M requests attempts to access the network simultaneously, the degradation reaches 8% as a completion task rate. However, using our “Coexistence of Heterogeneous traffic Analyzer and Network Architecture for Long term evolution” (CHANAL) solution, we can achieve a service completion rate of 96%. Full article
Show Figures

Figure 1

26 pages, 7340 KiB  
Article
Versatile Video Coding-Post Processing Feature Fusion: A Post-Processing Convolutional Neural Network with Progressive Feature Fusion for Efficient Video Enhancement
by Tanni Das, Xilong Liang and Kiho Choi
Appl. Sci. 2024, 14(18), 8276; https://doi.org/10.3390/app14188276 - 13 Sep 2024
Viewed by 1638
Abstract
Advanced video codecs such as High Efficiency Video Coding/H.265 (HEVC) and Versatile Video Coding/H.266 (VVC) are vital for streaming high-quality online video content, as they compress and transmit data efficiently. However, these codecs can occasionally degrade video quality by adding undesirable artifacts such [...] Read more.
Advanced video codecs such as High Efficiency Video Coding/H.265 (HEVC) and Versatile Video Coding/H.266 (VVC) are vital for streaming high-quality online video content, as they compress and transmit data efficiently. However, these codecs can occasionally degrade video quality by adding undesirable artifacts such as blockiness, blurriness, and ringing, which can detract from the viewer’s experience. To ensure a seamless and engaging video experience, it is essential to remove these artifacts, which improves viewer comfort and engagement. In this paper, we propose a deep feature fusion based convolutional neural network (CNN) architecture (VVC-PPFF) for post-processing approach to further enhance the performance of VVC. The proposed network, VVC-PPFF, harnesses the power of CNNs to enhance decoded frames, significantly improving the coding efficiency of the state-of-the-art VVC video coding standard. By combining deep features from early and later convolution layers, the network learns to extract both low-level and high-level features, resulting in more generalized outputs that adapt to different quantization parameter (QP) values. The proposed VVC-PPFF network achieves outstanding performance, with Bjøntegaard Delta Rate (BD-Rate) improvements of 5.81% and 6.98% for luma components in random access (RA) and low-delay (LD) configurations, respectively, while also boosting peak signal-to-noise ratio (PSNR). Full article
Show Figures

Figure 1

19 pages, 10886 KiB  
Article
Advancing Nighttime Object Detection through Image Enhancement and Domain Adaptation
by Chenyuan Zhang and Deokwoo Lee
Appl. Sci. 2024, 14(18), 8109; https://doi.org/10.3390/app14188109 - 10 Sep 2024
Viewed by 1804
Abstract
Due to the lack of annotations for nighttime low-light images, object detection in low-light images has always been a challenging problem. Achieving high-precision results at night is also an issue. Additionally, we aim to use a single nighttime dataset to complete the knowledge [...] Read more.
Due to the lack of annotations for nighttime low-light images, object detection in low-light images has always been a challenging problem. Achieving high-precision results at night is also an issue. Additionally, we aim to use a single nighttime dataset to complete the knowledge distillation task while improving the detection accuracy of object detection models under nighttime low-light conditions and reducing the computational cost of the model, especially for small targets and objects contaminated by special nighttime lighting. This paper proposes a Nighttime Unsupervised Domain Adaptation Network (NUDN) based on knowledge distillation to address these issues. To improve the detection accuracy of nighttime images, high-confidence bounding box predictions from the teacher and region proposals from the student are first fused, allowing the teacher to perform better in subsequent training, thus generating a combination of high-confidence and low-confidence pseudo-labels. This combination of feature information is used to guide model training, enabling the model to extract feature information similar to that of source images in nighttime low-light images. Nighttime images and pseudo-labels undergo random size transformations before being used as input for the student, enhancing the model’s generalization across different scales. To address the scarcity of nighttime datasets, we propose a nighttime-specific augmentation pipeline called LightImg. This pipeline enhances nighttime features, transforming them into daytime features and reducing issues such as backlighting, uneven illumination, and dim nighttime light, enabling cross-domain research using existing nighttime datasets. Our experimental results show that NUDN can significantly improve nighttime low-light object detection accuracy on the SHIFT and ExDark datasets. We conduct extensive experiments and ablation studies to demonstrate the effectiveness and efficiency of our work. Full article
Show Figures

Figure 1

13 pages, 5330 KiB  
Article
ISAR Imaging Analysis of Complex Aerial Targets Based on Deep Learning
by Yifeng Wang, Jiaxing Hao, Sen Yang and Hongmin Gao
Appl. Sci. 2024, 14(17), 7708; https://doi.org/10.3390/app14177708 - 31 Aug 2024
Viewed by 1512
Abstract
Traditional range–instantaneous Doppler (RID) methods for maneuvering target imaging are hindered by issues related to low resolution and inadequate noise suppression. To address this, we propose a novel ISAR imaging method enhanced by deep learning, which incorporates the fundamental architecture of CapsNet along [...] Read more.
Traditional range–instantaneous Doppler (RID) methods for maneuvering target imaging are hindered by issues related to low resolution and inadequate noise suppression. To address this, we propose a novel ISAR imaging method enhanced by deep learning, which incorporates the fundamental architecture of CapsNet along with two additional convolutional layers. Pre-training is conducted through the deep learning network to establish the mapping function for reference. Subsequently, the trained network is integrated into the electromagnetic simulation software, Feko 2019, utilizing a combination of geometric forms such as corner reflectors and Luneberg spheres for analysis. The results indicate that the derived ISAR imaging effectively identifies the ISAR program associated with complex aerial targets. A thorough analysis of the imaging results further corroborates the effectiveness and superiority of this approach. Both simulation and empirical data demonstrate that this method significantly enhances imaging resolution and noise suppression. Full article
Show Figures

Figure 1

20 pages, 5395 KiB  
Article
Detection and Segmentation of Mouth Region in Stereo Stream Using YOLOv6 and DeepLab v3+ Models for Computer-Aided Speech Diagnosis in Children
by Agata Sage and Pawel Badura
Appl. Sci. 2024, 14(16), 7146; https://doi.org/10.3390/app14167146 - 14 Aug 2024
Cited by 2 | Viewed by 1368
Abstract
This paper describes a multistage framework for face image analysis in computer-aided speech diagnosis and therapy. Multimodal data processing frameworks have become a significant factor in supporting speech disorders’ treatment. Synchronous and asynchronous remote speech therapy approaches can use audio and video analysis [...] Read more.
This paper describes a multistage framework for face image analysis in computer-aided speech diagnosis and therapy. Multimodal data processing frameworks have become a significant factor in supporting speech disorders’ treatment. Synchronous and asynchronous remote speech therapy approaches can use audio and video analysis of articulation to deliver robust indicators of disordered speech. Accurate segmentation of articulators in video frames is a vital step in this agenda. We use a dedicated data acquisition system to capture the stereovision stream during speech therapy examination in children. Our goal is to detect and accurately segment four objects in the mouth area (lips, teeth, tongue, and whole mouth) during relaxed speech and speech therapy exercises. Our database contains 17,913 frames from 76 preschool children. We apply a sequence of procedures employing artificial intelligence. For detection, we train the YOLOv6 (you only look once) model to catch each of the three objects under consideration. Then, we prepare the DeepLab v3+ segmentation model in a semi-supervised training mode. As preparation of reliable expert annotations is exhausting in video labeling, we first train the network using weak labels produced by initial segmentation based on the distance-regularized level set evolution over fuzzified images. Next, we fine-tune the model using a portion of manual ground-truth delineations. Each stage is thoroughly assessed using the independent test subset. The lips are detected almost perfectly (average precision and F1 score of 0.999), whereas the segmentation Dice index exceeds 0.83 in each articulator, with a top result of 0.95 in the whole mouth. Full article
Show Figures

Figure 1

15 pages, 2056 KiB  
Article
Robust DOA Estimation Using Multi-Scale Fusion Network with Attention Mask
by Yuting Yan and Qinghua Huang
Appl. Sci. 2024, 14(11), 4488; https://doi.org/10.3390/app14114488 - 24 May 2024
Cited by 1 | Viewed by 1235
Abstract
To overcome the limitations of traditional methods in reverberant and noisy environments, a robust multi-scale fusion neural network with attention mask is designed to improve direction-of-arrival (DOA) estimation accuracy for acoustic sources. It combines the benefits of deep learning and complex-valued operations to [...] Read more.
To overcome the limitations of traditional methods in reverberant and noisy environments, a robust multi-scale fusion neural network with attention mask is designed to improve direction-of-arrival (DOA) estimation accuracy for acoustic sources. It combines the benefits of deep learning and complex-valued operations to effectively deal with the interference of reverberation and noise in speech signals. The unique properties of complex-valued signals are exploited to fully capture inherent features and rich information is preserved in the complex field. An attention mask module is designed to generate distinct masks for selectively focusing and masking based on the input. After that, the multi-scale fusion block efficiently captures multi-scale spatial features by stacking complex-valued convolutional layers with small size kernels, and reduces the module complexity through special branching operations. Experimental results demonstrate that the model achieves significant improvements over other methods for speaker localization in reverberant and noisy environments. It provides a new solution for DOA estimation for acoustic sources in different scenarios, which has significant theoretical and practical implications. Full article
Show Figures

Figure 1

Review

Jump to: Research

49 pages, 3154 KiB  
Review
An Investigation into the Utilisation of CNN with LSTM for Video Deepfake Detection
by Sarah Tipper, Hany F. Atlam and Harjinder Singh Lallie
Appl. Sci. 2024, 14(21), 9754; https://doi.org/10.3390/app14219754 - 25 Oct 2024
Cited by 5 | Viewed by 4377
Abstract
Video deepfake detection has emerged as a critical field within the broader domain of digital technologies driven by the rapid proliferation of AI-generated media and the increasing threat of its misuse for deception and misinformation. The integration of Convolutional Neural Network (CNN) with [...] Read more.
Video deepfake detection has emerged as a critical field within the broader domain of digital technologies driven by the rapid proliferation of AI-generated media and the increasing threat of its misuse for deception and misinformation. The integration of Convolutional Neural Network (CNN) with Long Short-Term Memory (LSTM) has proven to be a promising approach for improving video deepfake detection, achieving near-perfect accuracy. CNNs enable the effective extraction of spatial features from video frames, such as facial textures and lighting, while LSTM analyses temporal patterns, detecting inconsistencies over time. This hybrid model enhances the ability to detect deepfakes by combining spatial and temporal analysis. However, the existing research lacks systematic evaluations that comprehensively assess their effectiveness and optimal configurations. Therefore, this paper provides a comprehensive review of video deepfake detection techniques utilising hybrid CNN-LSTM models. It systematically investigates state-of-the-art techniques, highlighting common feature extraction approaches and widely used datasets for training and testing. This paper also evaluates model performance across different datasets, identifies key factors influencing detection accuracy, and explores how CNN-LSTM models can be optimised. It also compares CNN-LSTM models with non-LSTM approaches, addresses implementation challenges, and proposes solutions for them. Lastly, open issues and future research directions of video deepfake detection using CNN-LSTM will be discussed. This paper provides valuable insights for researchers and cyber security professionals by reviewing CNN-LSTM models for video deepfake detection contributing to the advancement of robust and effective deepfake detection systems. Full article
Show Figures

Figure 1

Back to TopTop