Music Similarity Detection Through Comparative Imagery Data
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsIn their manuscript, the authors address the problem of similarity detection in music recordings. In the long run, their efforts may contribute to the detection of plagiarism and the problem of recognizing music created by artificial intelligence. The manuscript is well structured. The main problems, aims and means are clearly stated.
First, a few minor observations, which I see only as suggestions and recommendations, not as obstacles for publication:
- The question is to what extent the label “imagery” is justified. In fact, it is not image processing, only the use of image processing methods. However, this does not change the message of the article.
- There are hints in the text, but perhaps it should be made explicit that the very definition of musical similarity is at least vague (in the musicological context as well). Similarly, the legal issues are very ambiguous. I think the authors could have afforded a clearer acknowledgement in this regard that even the best machine learning cannot solve this underlying issue. At the moment, this completely destroys any prospect of automating plagiarism detection any time soon.
- Likewise, it perhaps should be noted that the proposed workflow (however advanced) is obviously primarily designed for song-type genres. I would be skeptical of applying the results of this paper to large and instrumentally complex sections that would, for example, be relevant to plagiarism in film music.
Aside from these minor points, I have one major objection: I am afraid that the training and validation datasets are very small, given the size of the model. I completely agree with the authors that “music-informed” preprocessing can reduce the data requirements, but six simple songs is really not much. I'm aware that the songs have been rearranged, which has increased their number in some way, but I would consider that more as "part of the game", not as a "fresh new input".
I would be happy to recommend the paper for publication, but please let the 80%:20% dataset split provide at least several units of “validation songs”.
Author Response
Please see the attached PDF file.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper presented a technique for music similarity detection that is based on feature-based analysis and data visualization. Feature-based sub-models were trained using imagery data and an ensemble model for combining the predictions of the sub-models was proposed. Using visual imagery facilitates explainable AI.
The paper is well-written. The experimental section is comprehensive. The main conclusions of the paper are supported by the experimental results.
The following are the main recommendations to further improve the paper:
- Add a paragraph at the end of Section 1 presenting the main outline of the paper.
- C3 is not mentioned in paragraph lines 99-104.
- Add a link for the dataset used for training and validation (Line 91).
- Figure 4 and its discussion need to be improved. For example, why (bottom) appears in the figure caption below? Should this figure present the overall architecture.
- Review Lines 443-446. For example, ‘this …performance does not mean that …their performance in independent testing is not assured since … is not trivial to understand. I recommend re-writing this part using simpler statements.
- Some abbreviations do not appear in the Abbreviations list e.g. CPI, V, and T
- Cite more references. Also, Reference #21 does not have any link.
The authors are recommended to fix the following typos:
- Cite a reference in Line 30. “imitation is the sincerest form of flattery”.
- “showed that the former outperformed by latter” in Line 45. It should be “the former was outperformed by the latter”.
- Line 51: why to use past tense in “Text-based approaches focused…”
- Line 68: “As there are numerous ways for specifying a feature in music.” is not a full sentence.
- Line 70: “to find most effective” should be “to find the most effective”.
- Line 175: Tables 1-5 instead of 1~5.
- Line 184: “were used to training” should be “were used for training”.
- Line 330: ‘compare’ is repeated twice.
- Line 353: “did not result heatmaps as promising” is not correct English.
- Line 510: “using , which” using what?
- Line 534: “is not considering” should be “is not considered”
- In the caption of Figure 14, I recommend defining the acronyms used such as T, V, and CPI.
Author Response
Please see the attached PDF file.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsAuthors use visualization (such as corelation heat maps) to present relationships between music. Some types of music segments are also presented visually with histograms. Authors use machine learning to train the system to recognize similarity between music segments, to enable plagiarism detection.
Authors should consider changing:
1. Title - it is not complete, since it only presents problem (music similarity detection) and part of the solution (imagery data)...but the contributing method, tools and visualization with correlation heatmaps is not mentioned.
2. Method - it is not clear how correlation heatmap is created for music comparison - how these square segments get colors, what does one colored square relate to (which part of music, duration?), wich music segments were used in heatmap (obviously not all music composition is presented, but only a segment).
3. In second contribution announcement (line 110), authors announce that they will deal with predictions and that the proposed approach "improve the interpretability of model predictions". However, there are no clear results about any predictions or evaluation of any performance of the proposed system based on some performance metrics.
4. The proposed method is presented at Figure 1. It is a bit confusing - why authors mention other machine or/and human centric processes to be included for other decisions (ensemble decisions). It is very much important to clearly make distinction what is automated and what is human-based in the proposed method...
5. Figure 2. presents surprising results in comparison of different songs, such as "Under pressure" with "Ice Ice Baby" in (b) and with "Bitter sweet symphony" in (c) example.It is surprising that for different songs, correlation heatmaps show some periodically similar segments...It was expected to see irregularity in heat maps, not so much similarity. How is it possible?
6. What does 0-100 mean in figure 2 - it is said that it repersent "temporal steps along axes of 1 bar song segments". It is not understandable what "a bar song segment" means.
7. One song has many music channels that coexist and make the whole impression of one song. How is this aspect included in this approach? Are channels of music separated before segmentation, visualization?
8. Figure 8. presents overall architecture, but it includes a particular tool jSymbolic...Usually, diagrams presenting an overal architecture consist of abstract elements, not particular tools combined with abstract elements.
9. Figure caption 4 - why is "(bottom)" placed at the end of the figure caption...it simply should not be there.
10. How are songs or classical music pieces selected for this experiment? Why have these particular songs and classical pieces selected to be presented in this research...are there any results for datasets with multiple items? What is criteria for items and data sets selection, sources and number of items in dataset collection selected for the experiment?
Author Response
Please see the attached PDF file.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsIn my opinion, the manuscript is ready to be published.