Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Deep Audio Features and Self-Supervised Learning for Early Diagnosis of Neonatal Diseases: Sepsis and Respiratory Distress Syndrome Classification from Infant Cry Signals

Electronics 2025, 14(2), 248; https://doi.org/10.3390/electronics14020248

by Somaye Valizade Shayegh^* and Chakib Tadj

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Electronics 2025, 14(2), 248; https://doi.org/10.3390/electronics14020248

Submission received: 7 December 2024 / Revised: 24 December 2024 / Accepted: 6 January 2025 / Published: 9 January 2025

(This article belongs to the Special Issue Artificial Intelligence Technologies for Biomedicine and Healthcare Applications, 2nd Edition)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

1. Rephrase complex phrases to ensure the abstract is clear and accessible to a broader audience.

2. Clearly explain how the 2.5x increase in dataset size contributes to improved model robustness and reliability.

3. Provide a more detailed explanation of how balancing the number of newborn samples in each class affects the model's ability to generalize across unseen data.

4. Elaborate on the methodology for filtering out short-duration cry samples and justify why 40ms was chosen as the threshold.

5. Offer a clear rationale for the model's difficulty in detecting Sepsis, considering potential factors such as sample duration, data imbalance, or subtle acoustic patterns.

6. Improve figure annotations, particularly in Figure 4, to make learning rate adjustments and their impact on model performance more intuitive for readers.

Author Response

Comment 1: Rephrase complex phrases to ensure the abstract is clear
and accessible to a broader audience.
Response 1:
We agree with the comment and have revised the abstract to improve
clarity and accessibility, as suggested. The updated text is located in the
Abstract section, page 1, lines 4–17 of the manuscript and has been
highlighted in blue for ease of reference.
Revised Abstract:Neonatal mortality remains a critical global chal-
lenge, particularly in resource-limited settings with restricted access to ad-
vanced diagnostic tools. Early detection of life-threatening conditions like
sepsis and respiratory distress syndrome (RDS), which significantly con-
tribute to neonatal deaths, is crucial for timely interventions and improved
survival rates. This study investigates the use of newborn cry sounds, specif-
ically the expiratory segments (the most informative parts of cry signals)
as non-invasive biomarkers for early disease diagnosis. We utilized an ex-
panded and balanced cry dataset, applying self-supervised learning (SSL)
models—wav2vec 2.0, WavLM, and HuBERT—to extract feature represen-
tations directly from raw cry audio signals. This eliminates the need for man-
ual feature extraction while effectively capturing complex patterns associated
with sepsis and RDS. A classifier consisting of a single fully connected layer
was placed on top of the SSL models to classify newborns into Healthy, Sepsis,
or RDS groups. We fine-tuned the SSL models and classifiers by optimizing
hyperparameters using two learning rate strategies: linear and annealing. Re-
sults demonstrate that the annealing strategy consistently outperformed the
linear strategy, with wav2vec 2.0 achieving the highest accuracy of approx-
imately 90% (89.76%). These findings highlight the potential of integrating
this method into Newborn Cry Diagnosis Systems (NCDSs). Such systems
could assist medical staff in identifying critically ill newborns, prioritizing
care, and improving neonatal outcomes through timely interventions.
Comment 2: Clearly explain how the 2.5x increase in dataset size con-
tributes to improved model robustness and reliability.
Response 2:
I agree with the comment, and to address the reviewer’s concern, we
have elaborated on this explanation in the manuscript. The updated text
is located in Materials and Methods 201/Dataset Overview/Data
Utilization, Page 7, Lines 298-307 of the manuscript and has been high-
lighted in blue for ease of reference.
Revised Sections in the Manuscript: This sample count represents a
2.5-fold increase in sample size compared to previous works [8, 9, 13-15, 29,
30], offering a wider range of infant cries that capture more diverse acoustic
1
characteristics, noise levels, and clinical contexts. Such variety reduces the
risk of overfitting and encourages the model to learn robust, generalizable
representations, ultimately making it more adaptable to novel, real-world
inputs. In addition, by including equal numbers of newborns per class (RDS,
Healthy, and Sepsis), the dataset not only becomes larger but also more
balanced. This balanced composition reduces class bias, prompting the model
to learn features that are broadly applicable rather than relying on patterns
from a dominant category. Consequently, the model is better equipped to
handle diverse clinical presentations, ultimately improving its generalization
and ensuring more consistent, reliable performance in real-world neonatal
care scenarios.
Comment 3: Provide a more detailed explanation of how balancing
the number of newborn samples in each class affects the model’s ability to
generalize across unseen data.
Response 3:
We appreciate the reviewer’s feedback, and in response, we have expanded
our explanation in the manuscript. The revised text can be found in the
Materials and Methods/Dataset Overview/Data Utilization, page
7, lines 298–307 section and has been highlighted in blue for easy reference.
Revised Sections in the Manuscript: This sample count represents a
2.5-fold increase in sample size compared to previous works [8, 9, 13-15, 29,
30], offering a wider range of infant cries that capture more diverse acoustic
characteristics, noise levels, and clinical contexts. Such variety reduces the
risk of overfitting and encourages the model to learn robust, generalizable
representations, ultimately making it more adaptable to novel, real-world
inputs. In addition, by including equal numbers of newborns per class (RDS,
Healthy, and Sepsis), the dataset not only becomes larger but also more
balanced. This balanced composition reduces class bias, prompting the model
to learn features that are broadly applicable rather than relying on patterns
from a dominant category. Consequently, the model is better equipped to
handle diverse clinical presentations, ultimately improving its generalization
and ensuring more consistent, reliable performance in real-world neonatal
care scenarios.
Comment 4: Elaborate on the methodology for filtering out short-
duration cry samples and justify why 40ms was chosen as the threshold.
Response 4:
We thank the reviewer for this insightful comment. In our dataset, fewer
than five samples had durations shorter than 40 ms, and these were sys-
tematically excluded during preprocessing to ensure robustness and consis-
2
tency. Cry segments shorter than 40 ms (less than two frame lengths) may
lack meaningful information after being processed by the model’s convolu-
tional layers. The raw audio is downsampled by a factor of 320 through
these layers, resulting in a frame rate of 50 frames per second (20 ms per
frame). By excluding such short segments, we ensure that all inputs retain
sufficient information for effective learning after downsampling. Retaining
these segments, although minimal in number, could hinder the model’s abil-
ity to detect meaningful patterns. To address the reviewer’s concern, we have
clarified and elaborated on this explanation in the following sections of the
manuscript. The updated text is located in:
• Materials and Methods/Dataset Overview/Data Utilization,
Page 7, Lines 290-295
• Materials and Methods/Fine-Tuning Self-Supervised Learn-
ing Models/Fine-tuning Transformer-based encoder, Page 10-
11, Lines 455-470
These revisions are also highlighted in blue in the manuscript for ease of
reference.
Revised Sections in the Manuscript:
• 1. Materials and Methods/Dataset Overview/Data Utilization: To en-
sure robustness and consistency, cry samples shorter than 40 ms were
systematically excluded during preprocessing. In our dataset, fewer
than five samples fell below this threshold, making the impact of this
exclusion minimal. This threshold aligns with the model’s frame rate
of 50 frames per second (20 ms per frame), ensuring that all inputs
retain sufficient information for effective learning. Further details are
provided in the Fine-Tuning Process section.
• 2. Materials and Methods/Fine-Tuning Self-Supervised Learning Models/Fine-
tuning Transformer-based encoder: In the models, the raw audio wave-
form is heavily downsampled through multiple convolutional layers. For
example, in wav2vec 2.0, a 4.5-second cry segment sampled at 16 kHz
(72,000 samples) is first processed by a series of convolutional layers.
The initial layer reduces the input by a factor of 5, transforming the
single-channel audio signal into 512-dimensional feature vectors. Each
of the subsequent six convolutional layers further downsamples the sig-
nal by a factor of 2, maintaining the 512-dimensional representation at
each stage. This progressive downsampling results in an overall down-
sampling factor of 320, thereby reducing the 72,000 samples to 225
3
feature frames; we have a frame rate of 50 frames per second (20 ms
per frame). Thus, the number of feature frames can be determined
by considering both the downsampling factor and the duration of the
signal. Each of these 225 frames is a 512-dimensional vector that encap-
sulates high-level audio features extracted from the raw signal. EXP
segments shorter than 40 ms (less than two feature frame lengths) lack
sufficient information after downsampling, making it difficult for the
model to identify meaningful patterns. Although such segments were
rare (fewer than five in the dataset), their exclusion ensures that the
remaining inputs are sufficiently informative, enhancing the robustness
and reliability of the classification process.
Comment 5: Offer a clear rationale for the model’s difficulty in detecting
Sepsis, considering potential factors such as sample duration, data imbalance,
or subtle acoustic patterns.
Response 5: I appreciate the valuable comment. To address the com-
ment, we have completed some experiments again to provide you with the
correct answer. The revised text can be found in the Discussion, pages
20–21, lines (717–719, 730-732, 737-743, 762-763, 764) section and
has been highlighted in blue for easy reference.
Revised Sections in the Manuscript: This study underscores the ef-
ficacy of the annealing learning rate strategy, which consistently surpassed
the linear approach across all three models—wav2vec 2.0, WavLM Base+,
and HuBERT—reaching a maximum accuracy of approximately 90% with
wav2vec 2.0. By incorporating dataset expansion, self-supervised learning
models, and the annealing LR strategy, the proposed approach shows strong
potential for practical applications in neonatal disease detection. Such ad-
vancements are particularly important for NCDS, where accurate and timely
detection is crucial for improving infant health outcomes. Despite the demon-
strated advances, a consistent challenge across most experiments, irrespec-
tive of the learning rate strategy, lies in the slightly lower recall for Sepsis
compared to Healthy and RDS. This underscores the persistent challenge
of accurately detecting Sepsis, driven by (1) its higher proportion of short-
duration samples and inherent class imbalance, and (2) the subtle, complex
patterns that set Sepsis apart from other classes. While the proposed ap-
proach improves overall performance, further research is necessary to address
these complexities in Sepsis classification, which remains notably more chal-
lenging than distinguishing either RDS or Healthy cases. In data processing,
as outlined in the Data Utilization section, we excluded samples under 40
ms to align with our approach’s frame rate of 50 frames per second (20 mil-
liseconds per frame). Notably, Sepsis samples include a significantly higher
4
proportion of short-duration segments, with 20 out of 2,799 samples falling
between 40-60 ms, compared to only 3 in Healthy and 1 in RDS. Although the
total duration for each class remains comparable—1982.90 seconds for Sep-
sis, 1983.77 seconds for RDS, and 1961.35 seconds for Healthy—the higher
frequency of short segments in Sepsis may limit the model’s capacity to iden-
tify intricate trends essential for reliable classification. Furthermore, Sepsis
presents inherently complex and subtle patterns, rendering it more challeng-
ing to distinguish than the other two classes.
To further understand these limitations, we conducted binary classifi-
cation experiments using our proposed framework. First, focusing only on
Sepsis and Healthy classes while testing the impact of raising the minimum
duration threshold above 40 ms. The results suggested that shorter segments
often lack the level of detail required for robust classification, a shortcoming
exacerbated by the class imbalance in Sepsis. Next, we ran two additional
binary scenarios in which we separated Sepsis and Healthy in one case, and
RDS from Healthy in another. After hyperparameter tuning, our approach
achieved slightly better performance when distinguishing RDS from Healthy
than when classifying Sepsis, underscoring the persistent challenge of accu-
rately detecting Sepsis. Ultimately, two main factors hinder Sepsis recogni-
tion: (1) the elevated proportion of short samples, and (2) the subtle, complex
nature of Sepsis itself. Our study expands upon previous works [33, 34] that
classified infant cries into Sepsis, RDS, and Healthy categories using smaller
subsets of the dataset with fewer EXP segments and an uneven distribution
of newborns across classes. These prior studies used 1,132 and 1,300 samples
per class, respectively, compared to our balanced and comprehensive dataset
of 2,799 samples per class, derived from 17 infants per category. Unlike
these studies, which relied on feature extraction and combination techniques
before classification, our approach processes the raw waveform of EXP seg-
ments without explicit feature extraction. Notably, [?] excluded all segments
shorter than 200 ms, arguing they lack sufficient information for cry analysis.
Similarly, [?] stated that
samples less than 17 s were excluded as they were noninformative
recordings that may have disturbed the training process.
By contrast, our inclusion of samples as short as 40 ms expands the dataset’s
size and variety, enhancing the model’s generalization to real-world scenarios
where infant cries may occur in brief bursts. This broader inclusion criterion
likely increases data variability, making the model more adaptable to diverse
practical applications.
While our approach demonstrates strong performance, particularly through
the use of self-supervised learning models, an annealing learning rate strat-
5
egy, and an expanded dataset, The reduced accuracy in detecting Sepsis
underscores the need for more in-depth research to address challenges posed
by shorter-duration samples, class imbalance, and the inherently complex
patterns of Sepsis.
Future efforts could explore advanced signal processing techniques to en-
hance features in short samples and complex phenomena, develop models
specifically optimized for imbalanced and variable-length data, and incor-
porate stratified k-fold cross-validation to ensure robust evaluation and fair
representation of all classes. By refining feature representation and tailoring
models to these specific challenges, future research can build on the strong
foundation established in this study to further enhance diagnostic accuracy.
In conclusion, this study demonstrates the potential of combining annealing
learning rates, self-supervised learning models, and diverse data inclusion
to advance infant cry-based disease classification. While Sepsis detection
remains more challenging compared to RDS and Healthy, the approach es-
tablishes a strong foundation for further advancements in neonatal healthcare
applications. By addressing current limitations, such as class imbalance and
the complexity of short-duration samples, future research can build on these
findings to enhance diagnostic accuracy and improve outcomes in neonatal
care.
Comment 6: Improve figure annotations, particularly in Figure 4, to
make learning rate adjustments and their impact on model performance more
intuitive for readers.
We agree with the comment and have improved the annotations in Figures
4 and 6 to make them more intuitive for readers. These updates can be found
specifically in the Experimental Results, page 14, Figure 4 and the
Experimental Results, page 17, Figure 6 sections of the manuscript,
and they have been highlighted in blue for ease of reference.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript addresses an important problem of classifying pathologically induced cries. The three-class classification has been done using audio signals. Although it lacks novelty in methods, it makes good application of the models in very important area. There are some improvements required in the manuscript before it could be recommended for publication. I have following comments and suggestions:

Firstly, while the dataset building has been listed as one of the two contributions (Line 193), which does not appear relevant, as the dataset is not made publicly available (line 40, line 729). Therefore, expansion (line 194) in the dataset which is not available for further use by wider research community cannot be considered as key contribution.
(Section 3.1.1 )The dataset collection is clinical settings is likely to be subject to ethical approval. In data description, It is worth mentioning whether the approval from institutions was obtained and how was the consent of parents taken.
(Section 3.1.1) Data acquisition lacks sufficient details on annotations process and the data collection protocol is vaguely described. Were all the samples annotated based on the presence of the pathological conditions in the patients or were further examined by the experts. How many experts were involved in annotation and how was consensus obtained. Were there any inter-rater variations.
(Line 299) Although the overall dataset has been made perfectly balanced. How was the data balance ensured during the splits? Was each split (train, validation, test) also balanced. What was the splitting strategy, were the experiments subject-dependent or subject-independent. Further information on data split is required as if the split was done on samples instead of samples it could potentially cause data leak in test samples.
The description of the overall architecture is vague, the flow in Figure 2 and Figure 3 require improved explanation. For reproducibility, It would be great to distinguish between the feature extraction steps and the classification steps. As the architecture does not appear to be end-to-end model with raw signal input, overall architecture is not clear.
In presence of several methods in the literature review section, it would be great to see the comparison in results. Besides the per-class evaluation metrics it would be great to see overall F-1 score.

Overall, the manuscript is good and the results demonstrate the selected models worked well on this dataset.

Author Response

Comment 1: Firstly, while the dataset building has been listed as one
of the two contributions (Line 193), which does not appear relevant, as the
dataset is not made publicly available (line 40, line 729). Therefore, expan-
sion (line 194) in the dataset which is not available for further use by wider
research community cannot be considered as key contribution.
Response 1:
Thank you for pointing out the concern about the dataset expansion being
labeled as a “contribution.” We have revised the manuscript to clarify that
the expanded dataset is not openly available, primarily due to institutional
and privacy restrictions. To address your feedback, we have labeled it as a
development. The updated text is located in the Literature Review, page
4, lines 183–184 of the manuscript and has been highlighted in blue for
ease of reference.
Revised Sections in the Manuscript: This paper presents one devel-
opment and one key contribution. The Development is twofold: (1) we have
expanded the cry audio samples by approximately 2.5 times, representing a
significant increase compared to earlier works in our lab [8, 9, 13-15, 29, 30],
thereby strengthening the robustness of our findings, and (2) we have ad-
dressed the issue of biased data in prior studies. Although previous research
in our lab used equal samples for each class, the distribution of newborns
across classes was uneven. For the first time, we have included cry signals
from 17 newborns in each class—RDS, Healthy, and Sepsis—ensuring a more
balanced and unbiased dataset.
Comment 2: (Section 3.1.1 )The dataset collection is clinical settings
is likely to be subject to ethical approval. In data description, It is worth
mentioning whether the approval from institutions was obtained and how
was the consent of parents taken.
Response 2:
We appreciate the reviewer’s comment, and in response, we have added
our explanation to the manuscript. The added text can be found on Mate-
rials and Methods/Dataset Overview/Data Collection Procedure,
Page 5, Lines 211-214. Moreover, the study was conducted according to
the guidelines of the Declaration of Helsinki, and approved by the Ethics
Committee of ´Ecole de Technologie Sup´erieure (protocol code H20100401)
added to Institutional Review Board Statement. both parts have been high-
lighted in blue for easy reference.
Added explanation in the Manuscript: The study utilized data col-
laboratively established with Al-Raee and Al-Sahel hospitals in Lebanon,
as well as Saint Justine Hospital in Montreal, Canada. In line with the
granted ethical approval, informed consent was obtained from each newborn’s
1
guardian prior to recording, including details on the study’s aims, the nature
of the cry recordings, and their intended use. Cry audio signals (CASs) were
captured using a 2-channel WS-650M Olympus digital voice recorder at a
sampling frequency of 44.1 kHz and 16-bit resolution.
Comment 3: (Section 3.1.1) Data acquisition lacks sufficient details on
annotations process and the data collection protocol is vaguely described.
Were all the samples annotated based on the presence of the pathological
conditions in the patients or were further examined by the experts. How
many experts were involved in annotation and how was consensus obtained.
Were there any inter-rater variations.
Response 3:
Thank you for your feedback. Below, we provide clarifications to ad-
dress these points. Firstly, cry recordings labeled by medical staff and ex-
planations about this process can be found in the manuscript in Mate-
rials and Methods/Dataset Overview/Data Collection Procedure,
page 5, lines 225–225. Second, each labeled cry recording is annotated
into 13 different labels, including EXP, INSV, and others. Explanations
about this process can be found in the manuscript in Materials and Meth-
ods/Dataset Overview/Data Pre-processing/Segmentation, page 5,
lines 264–274.
Comment 4:(Line 299) Although the overall dataset has been made
perfectly balanced. How was the data balance ensured during the splits?
Was each split (train, validation, test) also balanced. What was the splitting
strategy, were the experiments subject-dependent or subject-independent.
Further information on data split is required as if the split was done on sam-
ples instead of samples it could potentially cause data leak in test samples.
Response 4:
We thank the reviewer for the comment, and to address the reviewer’s con-
cern, we have elaborated on this explanation in the manuscript. The updated
text is located in the Materials and Methods/Dataset Overview/Data
Utilization, page 7, lines 307–310 of the manuscript and has been high-
lighted in blue for ease of reference.
Revised Sections in the Manuscript: Finally, Samples were selected
entirely at random, with no predetermined criteria, validation, to create
training and testing datasets in a 70%, 15%, and 15% split, ensuring the
proposed NCDS remains unbiased regarding race, reason for crying, origin,
age, and gender.
Comment 5: The description of the overall architecture is vague, the
flow in Figure 2 and Figure 3 require improved explanation. For reproducibil-
2
ity, It would be great to distinguish between the feature extraction steps and
the classification steps. As the architecture does not appear to be end-to-end
model with raw signal input, overall architecture is not clear.
Response 5:
Thank you for your insightful feedback. We have revised the description
of our model architecture to provide a clearer and more detailed explanation,
explicitly emphasizing the end-to-end nature of our approach. Additionally,
Figure 2 depicts the self-supervised training process, illustrating how raw
audio is used to pre-train the Transformer-based encoders. Figure 3 illus-
trates the fine-tuning process, highlighting how the pre-trained encoders are
integrated with the classifier to perform the classification task. We have also
enhanced the caption for Figure 3 to provide a more comprehensive expla-
nation. The newly added and revised text can be found in the Materials
and Methods/Fine-Tuning Self-Supervised Learning Models/Fine-
tuning Transformer-based encoder, pages 10-12, lines 455–504, and
has been highlighted in blue for easy reference.
Revised and Added explanation in the Manuscript: In the models,
the raw audio waveform is heavily downsampled through multiple convolu-
tional layers. For example, in wav2vec 2.0, a 4.5-second cry segment sampled
at 16 kHz (72,000 samples) is first processed by a series of convolutional lay-
ers. The initial layer reduces the input by a factor of 5, transforming the
single-channel audio signal into 512-dimensional feature vectors. Each of the
subsequent six convolutional layers further downsamples the signal by a fac-
tor of 2, maintaining the 512-dimensional representation at each stage. This
progressive downsampling results in an overall downsampling factor of 320,
thereby reducing the 72,000 samples to 225 feature frames; we have a frame
rate of 50 frames per second (20 ms per frame). Thus, the number of feature
frames can be determined by considering both the downsampling factor and
the duration of the signal. Each of these 225 frames is a 512-dimensional vec-
tor that encapsulates high-level audio features extracted from the raw signal.
EXP segments shorter than 40 ms (less than two feature frame lengths) lack
sufficient information after downsampling, making it difficult for the model
to identify meaningful patterns. Although such segments were rare (fewer
than five in the dataset), their exclusion ensures that the remaining inputs
are sufficiently informative, enhancing the robustness and reliability of the
classification process.
After convolutional downsampling, the output from the convolutional lay-
ers, consisting of a sequence of 512-dimensional vectors, is projected into a
768-dimensional space through a linear transformation. This projection is
crucial for aligning the feature dimensions with the input requirements of
the Transformer-based encoder. The encoder itself comprises 12 stacked
3
Transformer layers, each meticulously designed to capture and model com-
plex temporal dependencies and intricate patterns within the audio signal.
At the heart of each Transformer layer lies the multi-head self-attention
mechanism. This component enables the model to weigh the significance of
different frames relative to one another within the sequence. Specifically, the
768-dimensional input vectors are linearly projected into three distinct spaces
to generate queries, keys, and values. Multiple attention heads operate in
parallel, each focusing on different aspects of the input, thereby allowing the
model to capture diverse relationships and interactions across the temporal
sequence. The outputs from all attention heads are then concatenated and
linearly transformed to maintain the original dimensionality.
Following the self-attention sub-layer, each Transformer layer incorpo-
rates a two-layer fully connected feed-forward network (FFN). The FFN in-
troduces non-linearity and enhances the model’s capacity to learn complex
feature transformations. The first linear layer expands the dimensionality
from 768 to 3072 and applies the Gaussian Error Linear Unit (GELU) acti-
vation function to introduce non-linearity. The second linear layer projects
the dimensionality back to 768, ensuring consistency across layers.Stacking
12 such Transformer layers enables the encoder to progressively refine the
feature representations, capturing both local and global dependencies within
the audio signal. Upon passing through all Transformer layers, the encoder
produces a sequence of 768-dimensional vectors, each corresponding to a 20
ms frame of the input audio.
To transition from these frame-level representations to a comprehensive
summary of the entire audio segment, a mean pooling layer is applied across
the temporal dimension. This pooling operation averages all 768-dimensional
vectors within each EXP segment, producing a single 768-dimensional vector
that encapsulates the overall characteristics of the input segment, regardless
of its duration. The aggregated vector is then passed through a one-layer fully
connected classifier, consisting of a single linear layer, which maps the 768-
dimensional input to three output logits corresponding to the pathological
classes: Healthy, Sepsis, and RDS. A softmax activation function is applied
to these logits to produce class probabilities, enabling the model to assign
the input audio segment to the most probable pathological category.
Comment 6: In presence of several methods in the literature review
section, it would be great to see the comparison in results. Besides the per-
class evaluation metrics it would be great to see overall F-1 score.
Response 6:
We appreciate the comment. To address this concern, we have added a
table 8 comparing the properties of our approach with two other methods that
4
classify the same categories using different proportions of our dataset and
methodologies. The newly added text can be found in the Experimental
Results/page 19-20, lines 703-706, and has been highlighted in blue for
easy reference.
Added Sections in the Manuscript: We compared our proposed
method with two previous studies, [29, 30], which classified infant cries into
Sepsis, RDS, and Healthy categories. Table 8 summarizes key differences and
results, including the number of samples per class, the number of newborns,
minimum duration filters, input features, proposed methods, and overall F1
scores.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The authors proposed a deep learning based neonatal infant diseases detection through acoustic crying input that is an interesting work. There are some technical comments that should be addressed clearly in the manuscript.

1). All the data used for training and testing should be tabulated in terms of sampling frequency, acoustic segmentation duration, and number of time series.

2). More details of the Transformer based encoder , ad cry classifier should be provided for the reproduction of the proposed network.

3). It should consider any environmental noises.

4). Room inpulse responses should be considered.

5). Actually the transformer was used for nonlinear temporal acoustic features to classify diseases. There are recent two advanced deep learning methods to extract these nonlinear and nonstationary sound temporal features including various noises: DNoiseNet: Deep learning-based feedback active noise control in various noisy environments; Deep learning-based active noise control on construction sites. These should be discussed and if possible, also comparative studies should be done.

Author Response

Comment 1: All the data used for training and testing should be tab-
ulated in terms of sampling frequency, acoustic segmentation duration, and
number of time series.
Response 1:
We thank the reviewer for this comment. In our work, ”segments” and
”samples” are used interchangeably to refer to the same terminology. We in-
terpret ”number of time series” as the count of these data instances, which are
used in training, validation, and testing. These EXP segments (or samples),
derived from audio signals, likely represent the distribution across dataset
partitions or categories (e.g., Healthy, Sepsis, RDS). Additionally, the acous-
tic segmentation duration varies across samples rather than being fixed. To
address this, we have included a detailed table in the manuscript that outlines
the sampling frequency, segmentation duration range, and segment counts for
each dataset partition.
I would like to point out that the original data in our dataset has a sam-
pling frequency of 44.1 kHz. However, since the models’ filters and learned
representations are designed for 16 kHz input, this discrepancy could result in
suboptimal feature extraction and reduced performance. To address this, the
sampling frequency is changed to 16 kHz during preprocessing. This adjust-
ment is discussed in detail in the methodology section to ensure compatibility
with the models and optimize performance.
We have updated Table 2, located in the Materials and Methods/Dataset
Overview/Data Utilization, page 7, Table 2 of the manuscript, and it
has been highlighted in blue for ease of reference.
Comment 2: More details of the Transformer based encoder , ad cry
classifier should be provided for the reproduction of the proposed network.
Response 2:
Thank you for your insightful feedback. We have revised the description
of our model architecture to provide a clearer and more detailed explana-
tion. The newly added and revised text can be found in the Materials
and Methods/Fine-Tuning Self-Supervised Learning Models/Fine-
tuning Transformer-based encoder, pages 10-12, lines 455–504, and
has been highlighted in blue for easy reference.
Revised and Added explanation in the Manuscript: In the models,
the raw audio waveform is heavily downsampled through multiple convolu-
tional layers. For example, in wav2vec 2.0, a 4.5-second cry segment sampled
at 16 kHz (72,000 samples) is first processed by a series of convolutional lay-
ers. The initial layer reduces the input by a factor of 5, transforming the
single-channel audio signal into 512-dimensional feature vectors. Each of the
subsequent six convolutional layers further downsamples the signal by a fac-
1
tor of 2, maintaining the 512-dimensional representation at each stage. This
progressive downsampling results in an overall downsampling factor of 320,
thereby reducing the 72,000 samples to 225 feature frames; we have a frame
rate of 50 frames per second (20 ms per frame). Thus, the number of feature
frames can be determined by considering both the downsampling factor and
the duration of the signal. Each of these 225 frames is a 512-dimensional vec-
tor that encapsulates high-level audio features extracted from the raw signal.
EXP segments shorter than 40 ms (less than two feature frame lengths) lack
sufficient information after downsampling, making it difficult for the model
to identify meaningful patterns. Although such segments were rare (fewer
than five in the dataset), their exclusion ensures that the remaining inputs
are sufficiently informative, enhancing the robustness and reliability of the
classification process.
After convolutional downsampling, the output from the convolutional lay-
ers, consisting of a sequence of 512-dimensional vectors, is projected into a
768-dimensional space through a linear transformation. This projection is
crucial for aligning the feature dimensions with the input requirements of
the Transformer-based encoder. The encoder itself comprises 12 stacked
Transformer layers, each meticulously designed to capture and model com-
plex temporal dependencies and intricate patterns within the audio signal.
At the heart of each Transformer layer lies the multi-head self-attention
mechanism. This component enables the model to weigh the significance of
different frames relative to one another within the sequence. Specifically, the
768-dimensional input vectors are linearly projected into three distinct spaces
to generate queries, keys, and values. Multiple attention heads operate in
parallel, each focusing on different aspects of the input, thereby allowing the
model to capture diverse relationships and interactions across the temporal
sequence. The outputs from all attention heads are then concatenated and
linearly transformed to maintain the original dimensionality.
Following the self-attention sub-layer, each Transformer layer incorpo-
rates a two-layer fully connected feed-forward network (FFN). The FFN in-
troduces non-linearity and enhances the model’s capacity to learn complex
feature transformations. The first linear layer expands the dimensionality
from 768 to 3072 and applies the Gaussian Error Linear Unit (GELU) acti-
vation function to introduce non-linearity. The second linear layer projects
the dimensionality back to 768, ensuring consistency across layers.Stacking
12 such Transformer layers enables the encoder to progressively refine the
feature representations, capturing both local and global dependencies within
the audio signal. Upon passing through all Transformer layers, the encoder
produces a sequence of 768-dimensional vectors, each corresponding to a 20
ms frame of the input audio.
2
To transition from these frame-level representations to a comprehensive
summary of the entire audio segment, a mean pooling layer is applied across
the temporal dimension. This pooling operation averages all 768-dimensional
vectors within each EXP segment, producing a single 768-dimensional vector
that encapsulates the overall characteristics of the input segment, regardless
of its duration. The aggregated vector is then passed through a one-layer fully
connected classifier, consisting of a single linear layer, which maps the 768-
dimensional input to three output logits corresponding to the pathological
classes: Healthy, Sepsis, and RDS. A softmax activation function is applied
to these logits to produce class probabilities, enabling the model to assign
the input audio segment to the most probable pathological category.
Comment 3: It should consider any environmental noises.
Response 3:
I agree with the comment, and to address the reviewer’s concern, we
have elaborated on the presence and consideration of environmental noise in
the manuscript. The updated text is located in the Materials and Meth-
ods/Dataset Overview/Data Collection Procedure, page 5, lines
216–223 of the manuscript and has been highlighted in blue for ease of
reference.
Revised Sections in the Manuscript: Recording took place directly
in authentic clinical environments—including maternity rooms and Neonatal
Intensive Care Units (NICUs)—without imposing strict procedural controls.
As a result, these recordings inherently included ambient hospital noises such
as staff conversations, medical equipment alarms, and cries from other in-
fants. Rather than excluding these acoustic elements, we preserved them to
maintain the ecological validity of the dataset, ensuring it accurately reflects
the complexity and variability of real-world neonatal care settings. This ap-
proach is intended to train a model robust enough to handle unpredictable
and noisy conditions often found in practical scenarios. The recorder was
positioned within a range of 10 to 30 cm from the newborn’s mouth during
data acquisition. The health condition of the newborns was assessed through
post-birth screenings, and their cries were categorized as healthy or linked to
specific pathologies based on medical reports.
Comment 4: Room inpulse responses should be considered.
Response 4:
We appreciate the reviewer’s comment regarding room impulse responses
(RIRs). Because the dataset was originally collected in 2013 under natural-
istic clinical conditions, no controlled acoustic measurements were taken to
capture RIRs. Although characterizing RIRs could enhance acoustic model-
3
ing and mitigate reverberation, our current goal is to develop a method that
remains effective in existing real-world conditions. In future work, we plan
to explore techniques to estimate or incorporate RIR-related features as re-
sources allow. We may also consider additional acoustic preprocessing meth-
ods, such as dereverberation, to further improve the robustness and accuracy
of our classification approach under various acoustic profiles. We have added
an explanation in Materials and Methods/Dataset Overview/Data
Collection Procedure, page 5, lines 227–231 of the manuscript, and it
has been highlighted in blue for ease of reference.
Adding explanation in the Manuscript:Although this study did not
explicitly estimate or compensate for room impulse responses (RIRs) due to
practical constraints and the historical nature of the dataset, we acknowl-
edge that such characterization could improve acoustic modeling. Future
work may involve techniques to approximate or incorporate RIRs to miti-
gate reverberation effects, thereby enhancing the model’s robustness to di-
verse clinical recording environments.
Comment 5: Actually the transformer was used for nonlinear tempo-
ral acoustic features to classify diseases. There are recent two advanced
deep learning methods to extract these nonlinear and nonstationary sound
temporal features including various noises: DNoiseNet: Deep learning-based
feedback active noise control in various noisy environments; Deep learning-
based active noise control on construction sites. These should be discussed
and if possible, also comparative studies should be done.
Response 5:
We thank the reviewer for their comments and suggestions. We have
carefully considered the recommendation to discuss and conduct comparative
studies involving DNoiseNet and deep learning-based active noise control on
construction sites. However, we respectfully disagree with the feasibility and
direct relevance of conducting such comparative studies within the current
scope of our work. This disagreement stems from several key factors:
Firstly, there are fundamental differences in objectives between our classification-
focused approach and the regression-based noise control methods introduced
by the reviewer. Secondly, the input feature types employed by these meth-
ods are divergent. Additionally, there is an incompatibility of loss functions
between the two approaches. Modifying the loss functions of regression-
based models to suit classification tasks would require significant architec-
tural changes and extensive retraining, which is beyond the scope of this
revision.
Furthermore, conducting comparative studies between DNoiseNet and
deep learning–based active noise control on construction sites would require
4
significant modifications to our current framework. This entails changing in-
put feature types, integrating alternative network architectures, performing
extensive hyperparameter optimization, and utilizing the NARVAL server
for additional computational resources. Given the tight seven-day revision
deadline and the existing computational queue on the NARVAL server, un-
dertaking such comparative experiments is unfeasible. These tasks demand
a longer timeframe to ensure thorough and accurate evaluations.
In summary, while we acknowledge the importance of exploring advanced
noise control methods, the inherent differences in objectives, input features,
and loss functions between ANC-based approaches and our classification-
focused methodology render direct comparative studies impractical within
the current scope. We have also outlined plans to explore such comparative
analyses in future research endeavors.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Accept.

Reviewer 2 Report

Comments and Suggestions for Authors

The revised manuscript has addressed all the issues noted in the previous revision. There are no further comments.

Article Menu

Deep Audio Features and Self-Supervised Learning for Early Diagnosis of Neonatal Diseases: Sepsis and Respiratory Distress Syndrome Classification from Infant Cry Signals

Further Information

Guidelines

MDPI Initiatives

Follow MDPI