Special Issue "Recent Advances in Multimedia Signal Processing and Communications"

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Circuit and Signal Processing".

Deadline for manuscript submissions: closed (31 May 2021).

Special Issue Editor

Prof. Dr. Konstantin Markov
E-Mail Website
Guest Editor
Human Interface Laboratory, Department of Computer and Information Systems, Graduate School of Computer Science and Engineering, The University of Aizu, Aizu-Wakamatsu City, Fukushima 965-8580, Japan
Interests: speech processing; music information retrieval; natural language processing; machine learning; deep learning

Special Issue Information

Dear Colleagues,

The recent rapid increase in computing power and networking speed, coupled with the availability of computer storage facilities, has led to an explosive development in multimedia signal processing and communications, with new services emerging continuously, such as video conferencing, 360° video, augmented and virtual reality, immersive gaming, and multimedia human–computer interfaces, to name a few. These new growing services and applications require reliable data storage, easy access to multimedia content, and high-speed delivery, all of which result in higher demands in various areas of research, such as audio signal processing, image/video processing and analysis, communication protocols, content search, watermarking, etc.

In this Special Issue, we are particularly focused on recent developments in multimedia content processing and delivery over heterogeneous networks, as well as emerging multimedia applications.

Prof. Dr. Konstantin Markov
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All papers will be peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1800 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Speech/music/audio processing
  • Image/video processing
  • Multimedia communications and networking
  • Internet of Things (IoT)-based multimedia systems and applications
  • Deep learning for multimedia
  • Multimedia big data analytics
  • Multimedia processing for health care
  • Multimedia systems for emerging applications

Published Papers (8 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

Article
Multi-Scale Feature Fusion with Adaptive Weighting for Diabetic Retinopathy Severity Classification
Electronics 2021, 10(12), 1369; https://doi.org/10.3390/electronics10121369 - 08 Jun 2021
Viewed by 690
Abstract
Diabetic retinopathy (DR) is the prime cause of blindness in people who suffer from diabetes. Automation of DR diagnosis could help a lot of patients avoid the risk of blindness by identifying the disease and making judgments at an early stage. The main [...] Read more.
Diabetic retinopathy (DR) is the prime cause of blindness in people who suffer from diabetes. Automation of DR diagnosis could help a lot of patients avoid the risk of blindness by identifying the disease and making judgments at an early stage. The main focus of the present work is to propose a feasible scheme of DR severity level detection under the MobileNetV3 backbone network based on a multi-scale feature of the retinal fundus image and improve the classification performance of the model. Firstly, a special residual attention module RCAM for multi-scale feature extraction from different convolution layers was designed. Then, the feature fusion by an innovative operation of adaptive weighting was carried out in each layer. The corresponding weight of the convolution block is updated in the model training automatically, with further global average pooling (GAP) and division process to avoid over-fitting of the model and removing non-critical features. In addition, Focal Loss is used as a loss function due to the data imbalance of DR images. The experimental results based on Kaggle APTOS 2019 contest dataset show that our proposed method for DR severity classification achieves an accuracy of 85.32%, a kappa statistic of 77.26%, and an AUC of 0.97. The comparison results also indicate that the model obtained is superior to the existing models and presents superior classification performance on the dataset. Full article
(This article belongs to the Special Issue Recent Advances in Multimedia Signal Processing and Communications)
Show Figures

Figure 1

Article
A Comparative Study of Image Descriptors in Recognizing Human Faces Supported by Distributed Platforms
Electronics 2021, 10(8), 915; https://doi.org/10.3390/electronics10080915 - 12 Apr 2021
Cited by 1 | Viewed by 496
Abstract
Face recognition is one of the emergent technologies that has been used in many applications. It is a process of labeling pictures, especially those with human faces. One of the critical applications of face recognition is security monitoring, where captured images are compared [...] Read more.
Face recognition is one of the emergent technologies that has been used in many applications. It is a process of labeling pictures, especially those with human faces. One of the critical applications of face recognition is security monitoring, where captured images are compared to thousands, or even millions, of stored images. The problem occurs when different types of noise manipulate the captured images. This paper contributes to the body of knowledge by proposing an innovative framework for face recognition based on various descriptors, including the following: Color and Edge Directivity Descriptor (CEDD), Fuzzy Color and Texture Histogram Descriptor (FCTH), Color Histogram, Color Layout, Edge Histogram, Gabor, Hashing CEDD, Joint Composite Descriptor (JCD), Joint Histogram, Luminance Layout, Opponent Histogram, Pyramid of Gradient Histograms Descriptor (PHOG), Tamura. The proposed framework considers image set indexing and retrieval phases with multi-feature descriptors. The examined dataset contains 23,707 images of different genders and ages, ranging from 1 to 116 years old. The framework is extensively examined with different image filters such as random noise, rotation, cropping, glow, inversion, and grayscale. The indexer’s performance is measured based on a distributed environment based on sample size and multiprocessors as well as multithreads. Moreover, image retrieval performance is measured using three criteria: rank, score, and accuracy. The implemented framework was able to recognize the manipulated images using different descriptors with a high accuracy rate. The proposed innovative framework proves that image descriptors could be efficient in face recognition even with noise added to the images based on the outcomes. The concluded results are as follows: (a) the Edge Histogram could be best used with glow, gray, and inverted images; (b) the FCTH, Color Histogram, Color Layout, and Joint Histogram could be best used with cropped images; and (c) the CEDD could be best used with random noise and rotated images. Full article
(This article belongs to the Special Issue Recent Advances in Multimedia Signal Processing and Communications)
Show Figures

Figure 1

Article
Speech Processing for Language Learning: A Practical Approach to Computer-Assisted Pronunciation Teaching
Electronics 2021, 10(3), 235; https://doi.org/10.3390/electronics10030235 - 20 Jan 2021
Viewed by 700
Abstract
This article contributes to the discourse on how contemporary computer and information technology may help in improving foreign language learning not only by supporting better and more flexible workflow and digitizing study materials but also through creating completely new use cases made possible [...] Read more.
This article contributes to the discourse on how contemporary computer and information technology may help in improving foreign language learning not only by supporting better and more flexible workflow and digitizing study materials but also through creating completely new use cases made possible by technological improvements in signal processing algorithms. We discuss an approach and propose a holistic solution to teaching the phonological phenomena which are crucial for correct pronunciation, such as the phonemes; the energy and duration of syllables and pauses, which construct the phrasal rhythm; and the tone movement within an utterance, i.e., the phrasal intonation. The working prototype of StudyIntonation Computer-Assisted Pronunciation Training (CAPT) system is a tool for mobile devices, which offers a set of tasks based on a “listen and repeat” approach and gives the audio-visual feedback in real time. The present work summarizes the efforts taken to enrich the current version of this CAPT tool with two new functions: the phonetic transcription and rhythmic patterns of model and learner speech. Both are designed on a base of a third-party automatic speech recognition (ASR) library Kaldi, which was incorporated inside StudyIntonation signal processing software core. We also examine the scope of automatic speech recognition applicability within the CAPT system workflow and evaluate the Levenstein distance between the transcription made by human experts and that obtained automatically in our code. We developed an algorithm of rhythm reconstruction using acoustic and language ASR models. It is also shown that even having sufficiently correct production of phonemes, the learners do not produce a correct phrasal rhythm and intonation, and therefore, the joint training of sounds, rhythm and intonation within a single learning environment is beneficial. To mitigate the recording imperfections voice activity detection (VAD) is applied to all the speech records processed. The try-outs showed that StudyIntonation can create transcriptions and process rhythmic patterns, but some specific problems with connected speech transcription were detected. The learners feedback in the sense of pronunciation assessment was also updated and a conventional mechanism based on dynamic time warping (DTW) was combined with cross-recurrence quantification analysis (CRQA) approach, which resulted in a better discriminating ability. The CRQA metrics combined with those of DTW were shown to add to the accuracy of learner performance estimation. The major implications for computer-assisted English pronunciation teaching are discussed. Full article
(This article belongs to the Special Issue Recent Advances in Multimedia Signal Processing and Communications)
Show Figures

Figure 1

Article
A GAN-Based Video Intra Coding
Electronics 2021, 10(2), 132; https://doi.org/10.3390/electronics10020132 - 09 Jan 2021
Cited by 1 | Viewed by 638
Abstract
Intra prediction is a vital part of the image/video coding framework, which is designed to remove spatial redundancy within a picture. Based on a set of predefined linear combinations, traditional intra prediction cannot cope with coding blocks with irregular textures. To tackle this [...] Read more.
Intra prediction is a vital part of the image/video coding framework, which is designed to remove spatial redundancy within a picture. Based on a set of predefined linear combinations, traditional intra prediction cannot cope with coding blocks with irregular textures. To tackle this drawback, in this article, we propose a Generative Adversarial Network (GAN)-based intra prediction approach to enhance intra prediction accuracy. Specifically, with the superior non-linear fitting ability, the well-trained generator of GAN acts as a mapping from the adjacent reconstructed signals to the prediction unit, implemented into both encoder and decoder. Simulation results show that for All-Intra configuration, our proposed algorithm achieves, on average, a 1.6% BD-rate cutback for luminance components compared with video coding reference software HM-16.15 and outperforms previous similar works. Full article
(This article belongs to the Special Issue Recent Advances in Multimedia Signal Processing and Communications)
Show Figures

Figure 1

Article
Employing Subjective Tests and Deep Learning for Discovering the Relationship between Personality Types and Preferred Music Genres
Electronics 2020, 9(12), 2016; https://doi.org/10.3390/electronics9122016 - 28 Nov 2020
Cited by 2 | Viewed by 885
Abstract
The purpose of this research is two-fold: (a) to explore the relationship between the listeners’ personality trait, i.e., extraverts and introverts and their preferred music genres, and (b) to predict the personality trait of potential listeners on the basis of a musical excerpt [...] Read more.
The purpose of this research is two-fold: (a) to explore the relationship between the listeners’ personality trait, i.e., extraverts and introverts and their preferred music genres, and (b) to predict the personality trait of potential listeners on the basis of a musical excerpt by employing several classification algorithms. We assume that this may help match songs according to the listener’s personality in social music networks. First, an Internet survey was built, in which the respondents identify themselves as extraverts or introverts according to the given definitions. Their task was to listen to music excerpts that belong to several music genres and choose the ones they like. Next, music samples were parameterized. Two parametrization schemes were employed for that purpose, i.e., low-level MIRtoolbox parameters (MIRTbx) and variational autoencoder neural network-based, which automatically extract parameters of musical excerpts. The prediction of a personality type was performed employing four baseline algorithms, i.e., support vector machine (SVM), k-nearest neighbors (k-NN), random forest (RF), and naïve Bayes (NB). The best results were obtained by the SVM classifier. The results of these analyses led to the conclusion that musical excerpt features derived from the autoencoder were, in general, more likely to carry useful information associated with the personality of the listeners than the low-level parameters derived from the signal analysis. We also found that training of the autoencoders on sets of musical pieces which contain genres other than ones employed in the subjective tests did not affect the accuracy of the classifiers predicting the personalities of the survey participants. Full article
(This article belongs to the Special Issue Recent Advances in Multimedia Signal Processing and Communications)
Show Figures

Figure 1

Article
End-to-End Noisy Speech Recognition Using Fourier and Hilbert Spectrum Features
Electronics 2020, 9(7), 1157; https://doi.org/10.3390/electronics9071157 - 17 Jul 2020
Cited by 1 | Viewed by 1037
Abstract
Despite the progress of deep neural networks over the last decade, the state-of-the-art speech recognizers in noisy environment conditions are still far from reaching satisfactory performance. Methods to improve noise robustness usually include adding components to the recognition system that often need optimization. [...] Read more.
Despite the progress of deep neural networks over the last decade, the state-of-the-art speech recognizers in noisy environment conditions are still far from reaching satisfactory performance. Methods to improve noise robustness usually include adding components to the recognition system that often need optimization. For this reason, data augmentation of the input features derived from the Short-Time Fourier Transform (STFT) has become a popular approach. However, for many speech processing tasks, there is an evidence that the combination of STFT-based and Hilbert–Huang transform (HHT)-based features improves the overall performance. The Hilbert spectrum can be obtained using adaptive mode decomposition (AMD) techniques, which are noise-robust and suitable for non-linear and non-stationary signal analysis. In this study, we developed a DeepSpeech2-based recognition system by adding a combination of STFT and HHT spectrum-based features. We propose several ways to combine those features at different levels of the neural network. All evaluations were performed using the WSJ and CHiME-4 databases. Experimental results show that combining STFT and HHT spectra leads to a 5–7% relative improvement in noisy speech recognition. Full article
(This article belongs to the Special Issue Recent Advances in Multimedia Signal Processing and Communications)
Show Figures

Figure 1

Article
Individual Violin Recognition Method Combining Tonal and Nontonal Features
Electronics 2020, 9(6), 950; https://doi.org/10.3390/electronics9060950 - 08 Jun 2020
Viewed by 778
Abstract
Individual recognition among instruments of the same type is a challenging problem and it has been rarely investigated. In this study, the individual recognition of violins is explored. Based on the source–filter model, the spectrum can be divided into tonal content and nontonal [...] Read more.
Individual recognition among instruments of the same type is a challenging problem and it has been rarely investigated. In this study, the individual recognition of violins is explored. Based on the source–filter model, the spectrum can be divided into tonal content and nontonal content, which reflects the timbre from complementary aspects. The tonal/nontonal gammatone frequency cepstral coefficients (GFCC) are combined to describe the corresponding spectrum contents in this study. In the recognition system, Gaussian mixture models–universal background model (GMM–UBM) is employed to parameterize the distribution of the combined features. In order to evaluate the recognition task of violin individuals, a solo dataset including 86 violins is developed in this study. Compared with other features, the combined features show a better performance in both individual violin recognition and violin grade classification. Experimental results also show the GMM–UBM outperforms the CNN, especially when the training data are limited. Finally, the effect of players on the individual violin recognition is investigated. Full article
(This article belongs to the Special Issue Recent Advances in Multimedia Signal Processing and Communications)
Show Figures

Figure 1

Review

Jump to: Research

Review
The Impact of State-of-the-Art Techniques for Lossless Still Image Compression
Electronics 2021, 10(3), 360; https://doi.org/10.3390/electronics10030360 - 02 Feb 2021
Cited by 1 | Viewed by 880
Abstract
A great deal of information is produced daily, due to advances in telecommunication, and the issue of storing it on digital devices or transmitting it over the Internet is challenging. Data compression is essential in managing this information well. Therefore, research on data [...] Read more.
A great deal of information is produced daily, due to advances in telecommunication, and the issue of storing it on digital devices or transmitting it over the Internet is challenging. Data compression is essential in managing this information well. Therefore, research on data compression has become a topic of great interest to researchers, and the number of applications in this area is increasing. Over the last few decades, international organisations have developed many strategies for data compression, and there is no specific algorithm that works well on all types of data. The compression ratio, as well as encoding and decoding times, are mainly used to evaluate an algorithm for lossless image compression. However, although the compression ratio is more significant for some applications, others may require higher encoding or decoding speeds or both; alternatively, all three parameters may be equally important. The main aim of this article is to analyse the most advanced lossless image compression algorithms from each point of view, and evaluate the strength of each algorithm for each kind of image. We develop a technique regarding how to evaluate an image compression algorithm that is based on more than one parameter. The findings that are presented in this paper may be helpful to new researchers and to users in this area. Full article
(This article belongs to the Special Issue Recent Advances in Multimedia Signal Processing and Communications)
Show Figures

Figure 1

Back to TopTop