1. Introduction
Ensuring a baby’s well-being is a top priority for new and expecting parents, but the need for constant monitoring can be physically and emotionally exhausting. Monitoring children’s safety and health both in the first months of life and beyond is not only a crucial necessity, but also a moral obligation that lays the foundations of the broad term ‘parenting’. At the beginning of their lives, babies are sensitive and vulnerable to various external factors that can jeopardize their life and health, which is why constant monitoring of their developmental parameters is required. According to UNICEF, the year 2022 saw the lowest number of deaths among children under five, but nevertheless, the reported number of 4.9 million can and should be improved every year.
With advancements in technology, AI-powered baby monitoring systems have become an essential tool in assisting parents with infant care. Mobile applications are now an integral part of daily life, providing solutions for various aspects of parenting, including baby monitoring. In addition, the use of IoT (Internet of things) devices has gained momentum in various fields, with various monitoring devices, such as smart cameras, remote communication devices, and devices for monitoring breathing or heart rate, now on the market.
Integrating an AI-based model for recognizing infant cries can alleviate parental stress by providing timely alerts for parents even when they are not in close proximity to their baby. Such a model, integrated with hardware components, would be capable of alerting parents in very close to real time to any discomfort their child is experiencing, thus allowing parents to reach their child and solve problems as quickly as possible.
There has been significant progress in recent years in the research on automated infant cry detection and analysis. An ensemble learning approach combining support vector machines with deep feature extraction, called DeepSVM, was developed [
1] to classify different types of infant crying. DeepSVM’s accuracy surpassed 90%, demonstrating the effectiveness of Fourier frequency-domain features combined with neural networks for decision making. In contrast, mel frequency, bark frequency, and linear prediction cepstral coefficient features were also shown to be effective when combined with convolutional and recurrent neural networks for infant cry classification [
2]. The proposed hybrid architecture demonstrated the benefits of utilizing temporal feature extraction by achieving significant improvements over previous models. Improvements in classification performance were shown to also be achievable through transfer learning approaches. In [
3], pretrained VGG16 models were adapted to use gammatone-frequency cepstral coefficients and spectrograms for infant cry detection, showing that the fusion of multiple audio feature types could further enhance classification performance. BCRNet [
4] is another model developed using transfer learning to extract deep features, from which a subset are selected to be later introduced into the deep feature fusion. Recently, even transformer models have also been introduced in the task of baby cry detection. An SE–ResNet–Transformer model trained on MFCC features has been shown to outperform traditional models [
5]. Recent studies [
6,
7] have confirmed the feasibility of using MFCC features, even though they offer less informative features than other methods, by obtaining high performances with simpler models such as simple multi-layer perceptrons or convolutional neural networks.
Despite these advances, several important gaps remain in the literature. Firstly, there has been limited systematic comparison of modern deep learning architectures specifically for binary cry detection tasks in realistic monitoring environments. Secondly, while various feature extraction methods have been employed, an evaluation of their relative effectiveness across different model architectures is lacking. Thirdly, the trade-offs between using pretrained models versus training from scratch have not been thoroughly investigated for this specific domain.
In this study, we specifically aim to address these research gaps through a comprehensive evaluation of multiple deep learning architectures (ResNet, EfficientNet, VGG16, InceptionV3, DenseNet, and others) for binary detection of infant crying. Our methodology involves analyzing a curated database of infant cries combined from multiple sources, applying standardized preprocessing techniques, evaluating both spectrogram and MFCC feature representations, and systematically comparing the performance of pretrained versus non-pretrained models. Our primary objective is to determine the most effective approach for real-world infant cry detection systems that can generalize across diverse environments while maintaining high precision and recall rates.
3. Results
For the analysis of this array of neural network architectures, we present the loss (
Table 1) and accuracy values for the training, validation, and testing of these models. Moreover, the
F1 score is presented for the testing data. Along with the training loss values from
Table 1, we present the training time for each architecture in
Table 2 on an RTX 3080. Furthermore, we have trained each model using the spectrogram or the MFCCs in order to determine which of the two types of features are more appropriate for training of such a model in the detection of infant cries. This same procedure was applied to the machine learning models (SVM and RF); however, no loss values are present for these models in
Table 1, and the validation subset of the dataset was moved to the testing subset.
Another method for analyzing the pretrained models was to freeze the encoding layers of the pretrained architectures used. The effect of this choice on accuracy can be observed through the comparison of
Table 3 and
Table 4. Furthermore, in
Table 5, the effect on the
F1 score for the testing data is presented.
The results show a clear trend favoring spectrogram-based feature representations over MFCCs, particularly for deep learning models. Our results align with several other studies that have also found this trend in audio classification [
14,
28,
29] of spectrograms and MFCC. In the case of our study of infant crying, the MFCC’s logarithmic mel-filter may under-represent the higher frequency bands of infant crying [
30]. MFCC’s narrowband bins might not have the required resolution to capture the high-pitch harmonics of infant cries [
31]. MFCC’s transformation decorrelates frequency bands, which may diminish the harmonic structure and pitch modulations that are characteristic of infant cries. Seeing that specifically the pretrained models present this issue, spectrogram features might potentially resemble natural images more and thus be more adequate for transfer learning than MFCC features.
As seen in
Table 3, ResNet50 trained on spectrograms achieved the highest test accuracy of 98.40 ± 0.6%, outperforming other architectures, including VGG16, InceptionV3, and EfficientNet. This suggests that ResNet’s residual connections facilitate better feature propagation, enabling the model to effectively capture frequency variations characteristic of infant cries.
In contrast, MFCC-based models generally performed worse than their spectrogram-based counterparts. For instance, VGG16 trained on MFCCs exhibited a dramatic drop in test accuracy (80.53 ± 2.6%) compared to its spectrogram-based equivalent (95.26 ± 0.8%), indicating that MFCCs might lack the necessary spectral detail for robust cry classification. However, MobileNet and DenseNet performed relatively well on both spectrogram and MFCC features, suggesting that certain architectures are more adaptable to different feature representations.
Another observation that can be extracted from these analyses is the difference in performance between frozen (
Table 4) and non-frozen (
Table 3) pretrained models. While fine-tuned models consistently outperformed their frozen counterparts, the gap was particularly evident for EfficientNet, which struggled significantly when its encoder was frozen, dropping to 50.33 ± 0.1% accuracy. This suggests that EfficientNet relies heavily on fine-tuning rather than feature extraction alone, making it less effective in a frozen state for cry detection tasks.
The only models that maintained strong accuracy regardless of whether their pretrained layers were frozen or not were MobileNet and DenseNet. This indicates that these architectures may have a more flexible feature extraction capability, making them suitable for low-computational-resource scenarios where fine-tuning is not feasible.
While accuracy provides a general measure of classification performance, it can be misleading in cases of class imbalance, making the
F1 score a more reliable metric for evaluating model effectiveness. As seen in
Table 5, the
F1 scores of the best-performing spectrogram-based models—ResNet50 (98.42 ± 3.2%), InceptionV3 (98.10 ± 1.3%), and Xception (97.57 ± 1.1%)—closely align with their high accuracy values, confirming that these models maintain a strong balance between precision and recall. This is particularly important for infant cry detection, where a high recall ensures that actual cries are not missed, while high precision minimizes false alarms. As expected, the trend seen in MFCC-based models persisted with lower
F1 scores. This further reinforces the observation that MFCC features may not be as effective as spectrograms for this task. Additionally, EfficientNet, which had already shown poor accuracy when frozen, also recorded an
F1 score of 39.99 ± 32.6%, confirming its struggle to learn meaningful cry-related features without fine-tuning.
While deep learning models achieved the highest accuracy overall, traditional machine learning approaches such as RF and SVM performed surprisingly well. RF achieved 96.8% accuracy on spectrograms on the testing data, which is comparable to some deep learning models. This suggests that classical methods can still be viable, particularly for applications requiring lower computational costs. SVM performed slightly worse than RF, with an accuracy of 92.3% on the testing dataset, indicating that nonlinear kernel methods may not be as effective as ensemble-based approaches for this task.
These results indicate that ResNet50 (99.6% accuracy, 99.6% F1 score, non-frozen, spectrograms), DenseNet/Xception (99.2% accuracy, 99.2% F1 score, non-frozen, spectrograms), InceptionV3/Inception-ResNetV2 (99.0% accuracy, 99.0% F1 score, non-frozen, spectrograms), and MobileNet (98.6% accuracy, 98.5% F1 score, non-frozen, spectrograms) are the most viable options for real-time infant cry detection. While ResNet50 provides the highest accuracy, MobileNet is more computationally efficient, making it better suited for edge devices such as IoT-based baby monitors. Nevertheless, simpler models such as CNN1 (98.7% accuracy, 98.7% F1 score, MFCC) or CNN3 (98.2% accuracy, 98.2% F1 score, MFCC) obtained a surprisingly high performance, offering a more efficient option with only a slight decrease in performance. Additionally, traditional machine learning models like RF could serve as an alternative in resource-constrained environments.
To provide a clearer synthesis of our findings,
Table 6 presents a comprehensive ranking of the evaluated models based on multiple performance criteria. This ranking considers not only the raw performance metrics (accuracy and
F1 score) but also the computational efficiency and suitability for edge deployment. As shown in
Table 6, while ResNet50 achieves the highest accuracy, models like MobileNet and CNN1 offer compelling alternatives for resource-constrained environments with only minimal performance trade-offs.
We also show the impact of our augmentation in
Table 7 through an ablation study on the MobileNet architecture. We have evaluated the performance of the MobileNet architecture from the perspective of the
F1 score during the test on the dataset with various strategies of augmentation. The baseline version includes no augmentations, while the “all” version includes all augmentations (pitch shift, white noise, colored noise, time shift, volume change), and we created for each augmentation a version of the dataset containing only that individual strategy. In
Table 7, the impact of augmentation is shown. Through the increase of the dataset with random augmentations, the performance of the model is increased on the testing dataset.
4. Discussion
The development of an effective baby monitoring system, such as the one described in this paper, depends on the integration of both hardware and software components. The key feature of the system is the baby cry detection model, which must achieve high accuracy with minimal false positives for a robust performance in real-world conditions. In this work, we present an analysis of various machine learning models (including deep learning models) to determine the best option for such a system.
In this analysis, we also include two feature spaces created from the recorded signals based on signal processing, and we analyze the impact of training the encoding part of deep learning models. As illustrated in
Figure 1, the different representations of audio data (waveform, MFCC, and spectrogram) capture varying aspects of infant crying sounds. The spectrograms (
Figure 1c) clearly show the distinctive frequency patterns and temporal variations characteristic of infant cries, with stronger high-frequency components compared to other sounds. This visual distinction aligns with our quantitative findings in
Table 2,
Table 3 and
Table 4.
The creation of a baby cry detection model necessitates a comprehensive dataset, which was compiled from various sources, ensuring a diverse range of cry sounds and other audio inputs for robustness. The extended dataset created through augmentation enhances the model’s ability to generalize in real-world environments.
The evaluation of different machine learning models, including neural network architectures, has provided insights into their performance in classifying infant cries, and they have demonstrated the potential to achieve high precision and recall rates as indicated by the F1 scores obtained on the previously unseen test data. This dual emphasis on precision and recall is essential to minimize false positives while ensuring that actual cries are not missed.
Normally, it is expected that spectrogram-based feature extraction better preserves the time–frequency characteristics of the data, allowing deep learning architectures to distinguish between crying and background noises more effectively, while MFCCs, commonly used in speech processing, may lose essential frequency-related information when applied to this specific task. Our quantitative findings are shown in
Table 2,
Table 3 and
Table 4, where spectrogram-based models generally outperformed MFCC-based alternatives. Specifically, more complex models are more capable of learning from spectrograms than MFCC features. However, previously untrained (and simpler) convolutional neural networks do not have this preference. The performance rankings in
Table 5 further demonstrate that models capitalizing on these rich spectrogram features achieved superior classification results. Our results suggest that more complex models require more complex data (as accuracy is lower in both training and testing for the MFCC data). This is especially visible in the test accuracy and
F1 score obtained by the more complex models when applied on the MFCC data compared to the spectrogram data.
For the non-frozen models, we observed that their performance is satisfactory only for the spectrogram features, while for the MFCC features, they are often unable to learn, with the validation and test accuracy being close to that of random labelling. This is not the case for the frozen architectures, where the models are capable of learning from both the spectrograms and the MFCC features; however, the performance is slightly lower than that of non-frozen models combined with spectrogram features. The exception is the EfficientNet architecture, which is able to learn only on the non-frozen setting combined with spectrogram features. Two models escape this pattern, specifically DenseNet and MobileNet, which are capable of learning from both spectrogram and MFCC features regardless of whether their layers are frozen or not.
Overall, from the results obtained, the best option is the non-frozen setting with spectrogram features, where the best performance is obtained by the ResNet50 architecture. However, a simpler architecture such as CNN3 with no pretraining is capable of learning from both sets of features with satisfactory performance (~3% decrease in testing F1 score and ~2% in testing accuracy) with a considerably lower amount of training time. For a more robust architecture, our results indicate that DenseNet and MobileNet are the best options as they are capable of high accuracy on any combinations tested.
While ResNet50 achieved the highest accuracy (98.42 ± 3.2%), its deployment requirements (~25 million parameters, 98 MB model size, 4 GFLOPS) make it less suitable for resource-constrained environments. Our analysis shows that MobileNet offers an excellent performance–efficiency balance (95.27 ± 3.0% F1 score with only 16 MB size and 0.6 GFLOPS), making it particularly suited for edge deployment in real-world baby monitors. Additionally, we have quantified inference times across various hardware platforms (desktop CPU and Raspberry Pi 4), confirming that MobileNet maintains real-time performance (~30 ms per 5 s audio segment). The CNN3 model presents another compelling option for highly constrained environments, offering 97.8% accuracy with just 9 MB size and 0.3 GFLOPS.
While deep learning architectures achieved superior accuracy, it is worth noting that traditional machine learning models such as RF and SVM performed surprisingly well for both training and testing, showing their ability to generalize. RF achieved an accuracy of 96.8% on spectrograms, making it competitive with several deep learning models. SVM performed well (92.3%) but was slightly less effective than RF- and CNN-based architectures. This suggests that in low-computational environments or situations where deep learning is impractical, RF can serve as a viable alternative for infant cry detection. However, deep learning models retain an advantage in their ability to generalize across more complex and diverse datasets.
Infant vocalizations lack the phonetic structure present in adult speech, and therefore methods like MFCC and STFT might not capture the unique acoustic characteristics of infant cries optimally. Syllabic-scale features [
11], which are specifically designed to address the temporal and spectral properties of infant vocalizations, may offer a more tailored approach. While our study shows that traditional speech-derived features can still achieve high classification accuracy when paired with appropriate deep learning architectures, future work could benefit from exploring specialized feature extraction methods that account for the pre-linguistic nature of infant cries. This could potentially improve model performance further, especially in challenging acoustic environments with multiple overlapping sound sources.
In summary, the baby monitoring system proposed in this study is based on and necessitates an automatic cry detection model. We have analyzed several architectures with various feature types on a comprehensive dataset to determine the best approach for the detection of infant cries in various real-world conditions and environments. Our findings suggest that deep-learning-based baby cry detection is a viable and highly accurate solution for real-world monitoring systems. Among the models tested, ResNet50, DenseNet, and MobileNet performed the best, particularly when trained on spectrogram features and fine-tuned. While traditional machine learning models like RF provide a competitive alternative, deep learning remains the superior approach for real-time, high-accuracy detection. Moving forward, improving model efficiency, reducing false positives, and integrating multimodal data sources will be key in making AI-driven baby monitoring systems more practical and accessible. Future work could also encompass a more comprehensive dataset with various sounds from a baby’s repertoire in order to classify more precisely the current state of the infant.