A Multi-Source Pipeline for Extracting Traditional-Style Chinese Melody Data from Symbolic Files and Score Images
Abstract
1. Introduction
- 1.
- A multi-source ingestion architecture whose symbolic branch (MIDI, MusicXML) and image branch (Jianpu, Staff) produce a single event-level schema. The image branch is experimentally validated in this paper; the MIDI/MusicXML branch is wired in but not separately benchmarked in the present submission and is, therefore, reported as an architectural slot rather than as quantitatively validated evidence.
- 2.
- A VLM-driven Jianpu transcription strategy that uses a general-purpose vision language model to decode numbered notation without handcrafted symbol segmentation or task-specific fine-tuning, together with an explicit content-level accuracy measurement against a manually annotated ground-truth subset and an alt-prompt ablation that isolates how much of that accuracy is attributable to the production prompt.
- 3.
- A unified event representation and a structural QC layer that normalize heterogeneous outputs to a single downstream schema. The QC layer is explicitly structural—schema plus at least one melodic track with at least one musical event—rather than a content-level correctness gate; content-level correctness is reported separately in Section 4.7.
- 4.
- A quantitative accuracy and consistency audit of the Jianpu branch, comprising a 50-page manually annotated ground-truth benchmark (key/time-signature/BPM/first-16-note pitch F1 and SER; Section 4.7) and a 10-page self-consistency audit (Table 4), so that the reliability of the Jianpu branch is measured against an external reference rather than inferred from the pipeline’s own acceptance rate.
2. Related Work
2.1. Symbolic Music Datasets
2.2. Optical Music Recognition
2.3. Automated Dataset Construction Pipelines
3. Materials and Methods
3.1. Data Sources
3.2. Incremental Collection Protocol
3.3. System Architecture
3.4. Score-Type Classification
“Classify the following sheet music image into exactly one category: ‘jianpu’ (numbered notation using digits 1–7), ‘staff’ (standard five-line notation), ‘mixed’ (both systems on the same page), or an invalid category such as lyrics-only or unreadable noise. Return a JSON object with a single field ‘type’.”
3.5. Optical Music Recognition Strategy
3.5.1. Staff Notation
3.5.2. Jianpu Notation
3.5.3. Error Handling and Retry Logic
3.6. Unified Event Schema
3.7. Structural Quality Control
3.8. Implementation Details
4. Results
4.1. Prototype Workflow Validation
4.2. Collection Dynamics and Processing Yield
4.3. Pipeline Retention, Failure, and Runtime
4.4. QC Ablation Study
4.5. Qualitative Case Studies
4.6. Musicological Validation
4.7. Ground-Truth Benchmark on a Manually Annotated Jianpu Subset
4.8. Alternative-Prompting Baseline on the Note-Level Subset
- Original: the full schema-anchored prompt of Appendix A, which fixes the JSON field names (key, time_signature, bpm, parts→measures→note/duration) and the Jianpu conventions (octave dots written as 1^, 6_; lengths anchored to a quarter note).
- Minimal: a one-sentence Chinese instruction that says “this is a Jianpu image; return a JSON with fields key, time_signature, bpm, parts (each with role and measures; each note has note and duration)”. No Jianpu conventions, no octave-dot syntax, no example.
- Chain-of-thought (CoT): a two-step prompt that first asks the VLM to describe the page in natural language (key, meter, tempo, then note-by-note the first line of the melody), and only then asks it to emit the JSON.
5. Discussion
Further Systems-Level Limitations
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. Model Configuration and Prompt Templates
Appendix A.1. Model Access and Inference Parameters
| Parameter | Value Used | Nearest Gemini GenerationConfig Field |
|---|---|---|
| model | gemini-2.5-flash | model |
| system_message | per-task system prompt (see Appendix A.2/Appendix A.3); played the role of Gemini’s system_instruction | system_instruction |
| tools | none (no function calls; no Google Search grounding) | tools |
| temperature | 0.7 | temperature |
| top_p | not set; provider default applied | top_p |
| top_k | not exposed by the OpenAI-compatible wrapper | top_k |
| max_tokens | 65,536 | max_output_tokens |
| response_format | strict JSON required by the prompt; malformed output is discarded, see Section 3.5 | response_mime_type/response_schema |
| stop | not set | stop_sequences |
| safety_settings | provider default (not overridden by the client) | safety_settings |
| random_seed | not set at the API level | seed (where supported) |
| timeout | 180 s per request | (transport) |
| retries/retry_delay | 10/20 s | (client) |
| image input | single-turn; image embedded as a base64 data URL alongside the text prompt | vision-input part |
Appendix A.2. Score-Type Classifier Prompt
You are a score-classification assistant.
Analyze the input score image and return JSON only:
{
"type": "staff" | "jianpu" | "mixed" | "lyrics_only" | "junk",
"layout": "single_staff" | "grand_staff" | "score" | "unknown",
"quality": "high" | "low" | "incomplete",
"content": {
"has_lyrics": boolean,
"has_chords": boolean
}
}
Appendix A.3. Jianpu Transcription Prompt
You are a Jianpu transcription expert.
Convert the input score image into structured JSON data.
Rules:
- 1.
-
Identify the key declaration (for example 1=D) and time signature.
- 2.
-
Read notes in musical order and encode pitch with digits 1-7.
-
Use 0 for rests.
- 3.
-
Use ^ for upper-octave dots and _ for lower-octave dots.
- 4.
-
Encode duration in beats:
- -
-
bare digit = quarter note
- -
-
underline = halve the duration
- -
-
dash = extend by one beat
- -
-
dot = extend by one half
- 5.
-
Distinguish melody from lyrics or accompaniment when possible.
Return JSON in the following form:
{
"key": "D",
"time_signature": "4/4",
"bpm": 90,
"parts": [
{
"role": "melody",
"measures": [
[ {"note": "1", "duration": 1.0},
{"note": "2", "duration": 1.0},
{"note": "3", "duration": 2.0} ]
]
}
]
}
References
- Deng, J.; Tang, Y. Music Information Retrieval in the Deep Learning Era: A Comprehensive Review. Expert Syst. Appl. 2024, 240, 122565. [Google Scholar] [CrossRef]
- Ji, S.; Luo, J.; Yang, X. A Comprehensive Survey on Deep Music Generation: Multi-level Representations, Algorithms, Evaluations, and Future Directions. arXiv 2020, arXiv:2011.06801. [Google Scholar] [CrossRef]
- Ma, Y.; Øland, A.; Ragni, A.; Del Sette, B.M.; Saitis, C.; Donahue, C.; Lin, C.; Plachouras, C.; Benetos, E.; Shatri, E.; et al. Foundation Models for Music: A Survey. arXiv 2024, arXiv:2408.14340. [Google Scholar] [CrossRef]
- Müller, M. Fundamentals of Music Processing; Springer: Cham, Switzerland, 2015; ISBN 978-3-319-21944-8. [Google Scholar]
- Schedl, M.; Gómez, E.; Urbano, J. Music Information Retrieval: Recent Developments and Applications. Found. Trends Inf. Retr. 2014, 8, 127–261. [Google Scholar] [CrossRef]
- Copet, J.; Kreuk, F.; Gat, I.; Remez, T.; Kant, D.; Synnaeve, G.; Adi, Y.; Défossez, A. Simple and Controllable Music Generation (MusicGen). In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
- Huang, C.-Z.A.; Vaswani, A.; Uszkoreit, J.; Shazeer, N.; Simon, I.; Hawthorne, C.; Dai, A.M.; Hoffman, M.D.; Dinculescu, M.; Eck, D. Music Transformer: Generating Music with Long-Term Structure. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Raffel, C. Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching. Ph.D. Thesis, Columbia University, New York, NY, USA, 2016. Available online: https://colinraffel.com/publications/thesis.pdf (accessed on 1 March 2026).
- Hawthorne, C.; Stasyuk, A.; Roberts, A.; Simon, I.; Huang, C.-Z.A.; Dieleman, S.; Eck, D. Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019; Available online: https://openreview.net/forum?id=r1lYRjC9F7 (accessed on 1 March 2026).
- Du, Y. Introduction to Chinese National Music (Zhongguo Minzu Yinyue Gailun), 2nd ed.; Shanghai Music Publishing House: Shanghai, China, 2002; ISBN 978-7805530348. [Google Scholar]
- Zhang, Y. Eastern Rhythmic Foot and Western Colors: The Cross-Cultural Practice of Chinese Pentatonic Scales in Impressionist Music. J. Lit. Arts Res. 2025, 2, 1–11. [Google Scholar]
- Calvo-Zaragoza, J.; Hajič, J., Jr.; Pacha, A. Understanding Optical Music Recognition. ACM Comput. Surv. 2020, 53, 1–35. [Google Scholar] [CrossRef]
- Wu, F.-H. Applying Machine Learning in Optical Music Recognition of Numbered Music Notation. Int. J. Multimed. Data Eng. Manag. 2017, 8, 21–41. [Google Scholar] [CrossRef]
- Li, S.; Wu, Y. An Introduction to a Symbolic Music Dataset of Chinese Guqin Pieces and Its Application Example. J. Fudan Univ. (Nat. Sci.) 2020, 59, 276–285. [Google Scholar]
- Cuthbert, M.S.; Ariza, C. Music21: A Toolkit for Computer-Aided Musicology and Symbolic Music Data. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), Utrecht, The Netherlands, 9–13 August 2010; pp. 637–642. [Google Scholar]
- Bitteur, H. Audiveris: An Open-Source OMR Engine, version 5.3. 2023. Available online: https://audiveris.github.io/audiveris/ (accessed on 1 March 2026).
- Kong, Q.; Li, B.; Chen, J.; Wang, Y. GiantMIDI-Piano: A Large-Scale MIDI Dataset for Classical Piano Music. arXiv 2020, arXiv:2010.07061. [Google Scholar] [CrossRef]
- Donahue, C.; Mao, H.H.; Li, Y.E.; Cottrell, G.W.; McAuley, J. LakhNES: Improving Multi-Instrumental Music Generation with Cross-Domain Pre-Training. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), Delft, The Netherlands, 4–8 November 2019; pp. 685–692. [Google Scholar]
- Wang, Z.; Chen, K.; Jiang, J.; Zhang, Y.; Xu, M.; Dai, S.; Bin, G.; Xia, G. POP909: A Pop-song Dataset for Music Arrangement Generation. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), Montreal, QC, Canada, 11–15 October 2020; pp. 38–45. [Google Scholar]
- Pfleiderer, M.; Frieler, K.; Abeßer, J.; Zaddach, W.-G.; Burkhart, B. (Eds.) Inside the Jazzomat: New Perspectives for Jazz Research; Schott Campus: Mainz, Germany, 2017; Weimar Jazz Database; Available online: https://jazzomat.hfm-weimar.de/dbformat/dboverview.html (accessed on 20 March 2026).
- van Kranenburg, P.; de Bruin, M.; Grijp, L.; Wiering, F. The Meertens Tune Collections: The Annotated Corpus (MTC-ANN) Versions 1.1 and 2.0.1. Meertens Online Reports. 2016. Available online: https://www.liederenbank.nl/mtc/ (accessed on 20 March 2026).
- Simonetta, F.; Carnovalini, F.; Orio, N.; Rodà, A. Symbolic Music Similarity through a Graph-Based Representation. In Proceedings of the Audio Mostly Conference, Nottingham, UK, 18–20 September 2018. [Google Scholar]
- Gotham, M.; Jonas, P.; Bower, B.; Bosma, W.; Bergomi, M.; Couturier, L.; Dang, L. Scores of Scores: An öMNES Opus and its Community-Driven Curation. In Proceedings of the International Conference on Digital Libraries for Musicology (DLfM), Budapest, Hungary, 28 July 2018; pp. 87–95, The Josquin Research Project. Available online: https://josquin.stanford.edu/ (accessed on 20 March 2026).
- Thickstun, J.; Harchaoui, Z.; Kakade, S.M. Learning Features of Music from Scratch. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017; Available online: https://openreview.net/forum?id=rkFBJv9gg (accessed on 9 March 2026).
- Zhou, M.; Xu, S.; Liu, Z.; Wang, Z.; Yu, F.; Li, W.; Han, B. CCMusic: An Open and Diverse Database for Chinese Music Information Retrieval Research. Trans. Int. Soc. Music Inf. Retr. 2025, 8, 22–38. Available online: https://transactions.ismir.net/articles/10.5334/tismir.194 (accessed on 9 March 2026). [CrossRef]
- Baró, A.; Riba, P.; Calvo-Zaragoza, J.; Fornés, A. From Optical Music Recognition to Handwritten Music Recognition: A Baseline. Pattern Recognit. Lett. 2019, 123, 1–8. [Google Scholar] [CrossRef]
- Krishnan, R.; Natarajan, B.; Vadivel, M. Numbered Musical Notation Recognition via Deep Layout Analysis and Template Matching. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), San José, CA, USA, 21–26 August 2023. [Google Scholar]
- Bu, F.; Li, R.; Li, Z.; Li, Y. The Renaissance of Expert Systems: Optical Recognition of Printed Chinese Jianpu Musical Scores with Lyrics. arXiv 2025, arXiv:2512.14758. [Google Scholar] [CrossRef]
- Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual Instruction Tuning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023; Volume 36. [Google Scholar]
- Bertin-Mahieux, T.; Ellis, D.P.W.; Whitman, B.; Lamere, P. The Million Song Dataset. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), Miami, FL, USA, 24–28 October 2011; pp. 591–596. [Google Scholar]
- Vigliensoni, G.; Burlet, G.; Fujinaga, I. Optical Measure Recognition in Common Music Notation. In Proceedings of the 14th International Society for Music Information Retrieval Conference (ISMIR), Curitiba, Brazil, 4–8 November 2013; pp. 125–130. Available online: http://ismir2013.ismir.net/wp-content/uploads/2013/09/207_Paper.pdf (accessed on 9 March 2026).
- Google. Gemini 2.5 Flash [Software Documentation]. 2025. Available online: https://ai.google.dev/gemini-api/docs/models/gemini-2.5-flash (accessed on 9 March 2026).
- Narmour, E. The Analysis and Cognition of Basic Melodic Structures: The Implication-Realization Model; University of Chicago Press: Chicago, IL, USA, 1990; ISBN 978-0226568425. [Google Scholar]
- Bai, J.; Bai, S.; Yang, S.; Wang, S.; Tan, S.; Wang, P.; Lin, J.; Zhou, C.; Zhou, J. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv 2023, arXiv:2308.12966. [Google Scholar]
- Chen, Z.; Wu, J.; Wang, W.; Su, W.; Chen, G.; Xing, S.; Zhong, M.; Zhang, Q.; Zhu, X.; Lu, L.; et al. InternVL: Scaling Up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; Available online: https://github.com/OpenGVLab/InternVL (accessed on 1 March 2026).








| Dataset | Size | Genre | Format | QC |
|---|---|---|---|---|
| Lakh MIDI [8] | 176,581 files | Western pop/rock | MIDI | No |
| MAESTRO [9] | 1276 perf. | Classical piano | MIDI + audio | Yes |
| GiantMIDI-Piano [17] | 10,854 pieces | Classical piano | MIDI | Partial |
| NES Music DB [18] | 5278 tracks | Game music | Custom event | No |
| MusicNet [24] | 330 recordings | Classical chamber | MIDI + audio | Partial |
| POP909 [19] | 909 songs | Mandarin pop | MIDI | Partial |
| Weimar Jazz DB [20] | ∼456 solos | Jazz improvis. | SV/MIDI | Manual |
| Meertens Tune Coll. [21] | ∼18 k tunes | Dutch folksong | **kern/MusicXML | Manual |
| Wikifonia [22] | ∼6.5 k sheets | Pop lead sheets | MusicXML | No |
| Guqin dataset [14] | 71 pieces | Guqin music | MusicXML | Manual |
| CCMusic [25] | Multi-dataset | Chinese (audio) | Audio + meta | Partial |
| MSMP (ours) | Method | Trad. Chinese | Multi-src | Auto |
| Metric | Definition | Value | n |
|---|---|---|---|
| Key (pitch class) | Enharmonic-equivalent tonal pitch class match | 77.1% | 48 |
| Key (literal spelling) | Exact string match (e.g., vs. distinct) | 66.7% | 48 |
| Time signature | Exact match | 95.8% | 48 |
| BPM within | 100.0% | 44 | |
| BPM within | 100.0% | 44 | |
| Pitch F1 (first 16 notes) | Multiset F1 on integer pitch codes | 0.898 | 10 |
| Pitch-class F1 (first 16) | Multiset F1, octave-invariant | 0.955 | 10 |
| Symbol Error Rate | Mean Levenshtein/ | 0.150 | 10 |
| Page | Field | Value |
|---|---|---|
| Music 1 | Key (GT/Pred) | / (accidental dropped) |
| Meter (GT/Pred) | / | |
| GT first-16 | ||
| Pred first-16 | 2, 2, 3 | |
| Music 2 | Key (GT/Pred) | / (accidental dropped) |
| Meter (GT/Pred) | / | |
| GT first-16 | ||
| Pred first-16 | (pitch-class exact) |
| Quantity | Definition | Value |
|---|---|---|
| Successful calls | Out of total calls | 26/30 (86.7%) |
| Key agree rate | Fraction of run-pairs with identical predicted key | 22/22 (100.0%) |
| Time-signature agree rate | Fraction of run-pairs with identical | 22/22 (100.0%) |
| BPM agree rate (exact) | Fraction of run-pairs with identical predicted BPM | 22/22 (100.0%) |
| First-16 exact match | Fraction of run-pairs with bit-identical pitch sequence | 0.136 |
| First-16 pitch-code | Multiset over the first-16 pitch codes | 0.799 |
| First-16 pitch-class | Octave-invariant multiset | 0.930 |
| Prompt | Design | Key (pc) | Pitch F1 (pc) | SER | |
|---|---|---|---|---|---|
| Original (Appendix A) | Full schema + Jianpu convention + example | 10 | 1.000 | 0.931 | 0.156 |
| Minimal | One-sentence field list; no Jianpu convention | 10 | 1.000 | 0.945 | 0.135 |
| Chain-of-thought | Describe-then-JSON, two-step | 10 | 0.900 | 0.706 | 0.413 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Zhou, X.; Huang, Y.; Han, S.; Bai, J. A Multi-Source Pipeline for Extracting Traditional-Style Chinese Melody Data from Symbolic Files and Score Images. Computers 2026, 15, 298. https://doi.org/10.3390/computers15050298
Zhou X, Huang Y, Han S, Bai J. A Multi-Source Pipeline for Extracting Traditional-Style Chinese Melody Data from Symbolic Files and Score Images. Computers. 2026; 15(5):298. https://doi.org/10.3390/computers15050298
Chicago/Turabian StyleZhou, Xuanfei, Yinxuan Huang, Sining Han, and Jiangyao Bai. 2026. "A Multi-Source Pipeline for Extracting Traditional-Style Chinese Melody Data from Symbolic Files and Score Images" Computers 15, no. 5: 298. https://doi.org/10.3390/computers15050298
APA StyleZhou, X., Huang, Y., Han, S., & Bai, J. (2026). A Multi-Source Pipeline for Extracting Traditional-Style Chinese Melody Data from Symbolic Files and Score Images. Computers, 15(5), 298. https://doi.org/10.3390/computers15050298

