Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

A New Method for Detecting Onset and Offset for Singing in Real-Time and Offline Environments

Appl. Sci. 2022, 12(15), 7391; https://doi.org/10.3390/app12157391

by Behnam Faghih^*

, Sutirtha Chakraborty

, Azeema Yaseen

and Joseph Timoney

Reviewer 1:

Arthur Paté

Reviewer 2:

Maximos Kaliakatsos-Papakostas

Appl. Sci. 2022, 12(15), 7391; https://doi.org/10.3390/app12157391

Submission received: 27 May 2022 / Revised: 1 July 2022 / Accepted: 18 July 2022 / Published: 22 July 2022

(This article belongs to the Special Issue Processing Techniques Applied to Audio, Image and Brain Signals)

Round 1

Reviewer 1 Report

This article presents a relevant report about the design and development of a new, simple yet efficient algorithm for the estimation of onsets in an audio signal, but also of offsets, and transition phases between notes (in particular for sung audio signals). I believe this article is worthy of publication, but necessitates before a host of corrections and improvements in order to gain in clarity (which it lacks at crucial moments) and get the recognition it deserves. Please see my detailed comments below.

# Section 1
- L60: the "precision", "recall", and "F-measure" should be defined. A very brief definition will do, as the F-measure will be introduced more in details in the following of the article.
- L70: the "score informed method" doesn't seem to me like a standard/classical method: please briefly define.
- L73 & 76: please check the use of uppercase letters "No-dense", and "Inter-dataset"
- L83-84: "peaks) in an onset detection function": I feel that a sentence would be worth being added in previous paragraphs, introducing the concepts of "peak picking" and "detection function". Maybe a full paragraph should be included when the tutorial by Bello et al is cited, recalling the main classical methods before the machine learning era: detection function, peak picking in the time domain or in the frequency domain, zero-crossing rate, difference between different frames in the STFT...
- L98-99: please explain how the algorithms "can fail": is it by having a higher false positive rate? Please mention this, as this will help the reader understand the challenges at stake in your article.
- L102: "F-measure" should be explained
- L103: "much lower": please give the numbers for allowing the reader to make the comparison

# Section 2
- L158: please define the "spectral flux"
- L176 & 182: Aubio treats 2048-point-long windows (and Essentia deals with 1024 and 512 points). Is it whatever the sampling frequency? Also, please give the corresponding duration in seconds or milliseconds, in order to allow the reader to compare with other values, e.g. L197.
- L193: please state if there exists algorithms for the detection of offsets and transition phase. If yes, please explain why your algorithm is not compared with these other algorithms on the basis of offset detection (and transitions).
- L197: it is stated that the window size is a factor in the study. But previous paragraphs seem to state that the state-of-the-art algorithms are limited to, e.g. 2048 points for Aubio. This seems contradictory, please clarify.

# Section 3
- L226-227: it is stated that an offline algorithm is chosen for the detection of f0. It is not clear in the manuscript, but it seems to me that choosing an offline, not real-time algorithm for the first step will definitively make the whole algorithm *not real-time*... Please elaborate on this.
- L237: "same pitch frequency range". This step of the processing is very interesting, and the rationale is well explained. I wonder however if the authors' solution is more efficient that e.g. scale down all f0s to one specific octave, and then use a log frequency axis. This too should help a lot regularizing the slopes, and make them comparable.
- Figure 2: please consider adding a time scale (with ticks, labels, and numbers) on the x-axis. This is not necessary here, but is consistent with the following figures.
- L251: it would be much interesting to have figure 3 present the further processing of the signals shown in figure 2.
- L251: "upper" should be "left-hand" (or similar), and (L252) "lower" should be "right-hand" (or similar), since the figure aligns panels horizontally.
- L252: "an estimated pitch contour". How is it estimated? (I guess it is estimated from what is presented in figure 2b1 and 2b2, but it should be said) The manuscript should state this. It looks like some sort of moving average.
- Figure 3: please include a unit for the frequency in y-axis of panel a
- Figure 3: please include a unit for the y-axis of panel b
- Figure 3: please include labels, ticks, etc. for the x-axis (time) of panel b
- Figure 3: caption: I suggest to add some vertical dashed lines to indicate the approximate/expected onset & offset positions. It would be of much help to the reader.
- L259 (and afterwards, e.g. 264, 267): it is not clear what "following" means in this context. Therefore it makes this paragraph (3.4) unclear. It should be made clearer, maybe with another word? Is it a slope in itself? Is it the slope of the following point/sample?
- Figure 4, time axis: scale and numbers are hard to read, please consider rotating the labels
- L262-263 (caption of figure 4): "F4, F4, and E4": please give the average fundamental frequencies.
- Figure 5: please clearly state what "i" (as a subscript) refers to: is it the current point? the current sample?
- L268-269 "the cumulative sum of the consecutive points": is it their "amplitude" (= value on y-axis in figure 3b) that is summed? Please make this explicit.
- End of section 3.4. Can it be said that the algorithm, at this point, detects when the slope changes sign? I feel this can/should be stated very clearly as the summary of this section.
- Caption of figure 6: A classical box and whisker plot shows the median and quartiles, not the mean value. Or is it the cross that indicates the mean? If so, please state this in the caption, and make the cross bigger and with a different color, so that it is easier to read.
- L307 onwards: why are some portions of the text highlighted?
- L322: "across" is unclear here
- L323: please give typical values for t, depending on singing style.
- L332-333: what does "immediately" mean? Is it the following sample? Please make it clear. If so, it should appear more clearly in figures 4 and 8, that e.g. EndTransition and Onset follow each other by one sample only, maybe by, in addition to solid lines, using individual markers for each sample in the time series that are shown.
- L340: It is unclear when and how F0i is made to reach 0 in silent phases. Please explain how the algorithms are constrained to this.

# Section 4
- L357: it would be clearer to write "between the start time of the transition and the onset", as the start time of the transition appears earlier, and the onset later.
- L357: Why the second type of onset times? In figure 8, it seems to fit well the ground truth... but why?
- Figure 8: please show both pitch frequency (including its unit in the title of y-axis) and slope, in order to allow the reader to better understand and check your algorithm.
- L367: "F-score measures the similarity", here the F-score really must be explained. How does it measure the similarity? What is its range of variation? Between 0 and 1?
- L377: "that they are" may be replaced by "that is"
- L384: how do the 3.4% in accuracy translate in terms of F-score? In would be clearer to keep on using the same measures, for comparison purposes. (sameways L386)
- L392-394: It is unclear on what the ANOVA is run... Is it on the individual values that are averaged to give the values in tables 1 and 2, with groupings done according to the factors "window size" and "algorithm". This should be stated in the manuscript.
- Captions of table 1 and table 2: The meaning of the numbers in the title row (10, 50, 100, ...) must be indicated in the caption.
- tables 1 & 2: why are some scores in bold font?
- tables 1 & 2: is there an explanation for the fact that increasing the window size increases the F-score? If yes, this should be included in the manuscript. Also, what would the F-score become if increasing the window size above 250? Is it staying a the plateau it seems to have reached? It is decreasing? Or still increasing?
- L402 (and also in the caption of table 3): the concept of "acceptable duration" is unclear: why is it acceptable (sounds quite like a value judgement) and not, say, "typical"? At this level, it is unclear, by the "acceptable delay" (L410) is better argued and explained, and is in itself (pun apart) acceptable!
- L437-438: Would a smaller buffer size imply the use of shorter windows... and therefore lead to decreasing accuracy?

# References
- Refs. 1, 22 and 25 should include a publisher, DOI, or website address
- Ref. 6, 20, and 36 are incomplete
- Is Ref. 28 a PhD dissertation? If so, I think it should be stated more clearly, including the name of the university.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Summary: This paper presents a method for real time onset, offset and transition time estimation on a cappella singing voice. The presented implementation does consider an offline element, I.e. pitch tracking, but this component can be substituted by a (less accurate, according to the current state of the art) real-time pitch tracking method. Results are presented on two datasets and comparisons are reported with 7 other onset detection methods (most of which operate offline).

General comments:

- The paper is well written and all aspects of the method, data and results are very clear (except for some minor adjustments that need to be made, in my opinion).
- The presented method involves several parameters that appear to offer fine-tuning capabilities on datasets, which is both good (allows flexibility) and bad (requires multiple trials, while optimal results are not guaranteed).
- The presented results do not draw a clear picture for a decisive overall superiority of the proposed method in small windows of error (which is crucial for practical real-time applications).
- The real-time aspect, as presented, implies that trajectory changes are identified when they start, while the presented algorithm appears to be able to identify trajectory changes when they finish.
- The comparison is expensive, even though more datasets could have been tested. Datasets with annotations are limited, but results could be presented in additional datasets, even with fewer annotations (e.g. QBSH dataset), for showing the applicability potential of the method more widely and for examining the inner parameters of the implementation in new cases.

What needs to be improved:

- Please mention in the introduction that the current implementation does not consider real-time restrictions (because of the method used for pitch tracking.).

- In the conclusions, please provide your ideas about other datasets that need to be involved and about possible steps for alleviating the parameter adjustment problem, based on the dataset of application.

Other comments:

Section 2.2.3: It is not clear if Aubio onset detection works in real-time, since it includes peak detection. Please clarify at this point that Aubio works in real-time (as you do with all other methods).

Section 3.1. pYin, offline F0 detection is employed, which functions offline, but the method is described as real-time until this point. Please make sure that it is made clear from the beginning of the paper, that the current implementation and results do not involve solely real-time components.

Section 3.2. Why not use a logarithmic stretching function, which would “normalize” octave-related information?

Section 3.2. Equation 1. Different typesetting in “max” on the right and left-hand sides of the equation leave doubts, on the first read, about whether this variable is the same on both sides.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Article Menu

A New Method for Detecting Onset and Offset for Singing in Real-Time and Offline Environments

Further Information

Guidelines

MDPI Initiatives

Follow MDPI