Extracting Composition Expression Patterns from Materials Science Patent Documents Using SEP-Tags
Abstract
1. Introduction
- We propose the SEP-tag which denotes the boundaries of composition expression patterns, to efficiently extract such patterns from materials science patent documents. This method introduces explicit structural boundary tokens into named entity recognition training.
- We propose a four-axis evaluation framework that comprehensively assesses both entity recognition and pattern-level extraction quality. We demonstrate on English (10,166 sentences) and Japanese (975 sentences) materials science patent datasets that SEP-tag training achieves significant improvements in composition expression pattern extraction F1 score while preserving entity F1 score.
- We confirmed that phrases assigned SEP-tags are lexically distinct from composition expression pattern components. We also confirmed that there is no segmentation caused by SEP-tags in either the English or Japanese data.
- Based on the correlation between SEP-tag prediction quality and extraction accuracy of composition expression patterns, we confirmed that accurate prediction of SEP-tags directly leads to improved composition expression pattern extraction accuracy.
2. Related Work
2.1. Named Entity Recognition Approaches
LLM-Based Approaches
2.2. Named Entity Recognition in Materials Science
2.2.1. Early and Pre-Trained Models
2.2.2. Recent Advances
2.3. Structural and Relational Annotation in Named Entity Recognition
3. Proposed Method
3.1. Task Definition and Composition Expression Pattern Structure
3.2. SEP-Tag Named Entity Recognition Model
3.3. Extraction of Composition Expression Patterns from Predicted Labels
3.4. Four-Axis Evaluation Framework
3.4.1. Axis 1: Entity-Level F1 Score
3.4.2. Axis 2: Correct Span Exact Match Rate
3.4.3. Axis 3: Predicted Span Pattern Extraction F1 Score
3.4.4. Axis 4: Pattern Extraction F1 Score
4. Experiments
4.1. Datasets
4.1.1. English Dataset
4.1.2. Japanese Dataset
4.1.3. Definition of Composition Expression Patterns
4.2. Training Settings
4.3. Main Results
4.4. Statistical Significance Tests
5. Analysis
5.1. Why SEP-Tags Do Not Lower Entity F1 Score?
5.2. Results for Composition Expression Patterns in Axis 4
5.3. Why Does the Precision of the Model Without SEP-Tags Decline in Axis 4?
Error Analysis
5.4. Analysis of False Negatives in the Japanese Dataset
The Correlation Between SEP-Tags and the Accuracy of Composition Expression Pattern Extraction
5.5. Validation of the Non-Crossing Hypothesis
5.5.1. Method 1: Semantic Categorization of Strings with SEP-Tags
5.5.2. Method 2: Lexical Distinctness of SEP-Tag Strings from Content Entities
5.5.3. Method 3: SEP-Tag Non-Crossing Verification
5.6. Cross-Lingual Consistency
5.6.1. Entity F1 Score (Axis 1)
5.6.2. Reduction of False Positives (Axis 4)
5.6.3. Differences Between Languages
5.7. Comparison of Computational Costs
6. Conclusions
- Adding rule-based post-processing to the without SEP-tag baseline. We believe that this baseline without SEP-tags provides a fair comparison with the model that uses SEP-tags, as it employs the same architecture and training data, except for the difference in the presence or absence of SEP-tags. On the other hand, adding rule-based post-processing is expected to further enhance the competitiveness of the baseline.
- Improving recall by expanding the Japanese dataset.
- Direct improvement of SEP-tags prediction quality. As SEP-tags F1 score increases, the accuracy of composition expression pattern extraction improves consistently, and for samples with SEP-tags F1 score , the average F1 score has already reached 0.899 in Axis 4.
- Evaluation using independent test data consisting solely of gold labels. We have confirmed that the errors in the pseudo-labels generated by the RoBERTa-large training model are limited and sufficient in quality. However, since the test fold also contains pseudo-labeled data, we cannot dismiss the possibility of performance overestimation. In the future, we expect to further enhance the reliability of the evaluation by conducting assessments using independent test data composed solely of gold labels.
- Extraction of composition expression patterns through the introduction of SEP-tags into other domains. This approach may be applied to domains where structural patterns consisting of multiple entities exist and where the vocabulary defining the boundaries of these patterns is well-defined. For example, in domains such as academic papers and legal documents, similar structural pattern extraction can be expected if domain experts define the vocabulary that indicates these boundaries as SEP-tags.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. English Pattern Templates (Complete List)
| Pattern ID | Relation | Token Sequence | Count |
|---|---|---|---|
| ROSE05 | Compo_OS | atom → limitation → fig_UL | 149 |
| ROSE01 | Compo_OS | atom → fig_LL → unit → limitation | 133 |
| RRAE01 | Compo_RA | atom → fig_LL → limitation → fig_UL | 125 |
| ROSE04 | Compo_OS | atom → limitation → fig_LL → unit | 109 |
| RRAE04 | Compo_RA | atom → fig_LL → limitation → fig_UL → unit | 95 |
| RRAE18 | Compo_RA | fig_LL → unit → limitation → fig_UL → unit → atom | 65 |
| ROSE13 | Compo_OS | limitation → fig_UL → atom | 52 |
| RRAE02 | Compo_RA | atom → fig_LL → unit → limitation → fig_UL → unit | 47 |
| ROSE09 | Compo_OS | fig_UL → unit → limitation → atom | 44 |
| ROSE10 | Compo_OS | fig_LL → unit → limitation → atom | 44 |
| RRAE21 | Compo_RA | limitation → fig_LL → limitation → fig_UL → unit → atom | 43 |
| RRAE17 | Compo_RA | fig_LL → limitation → fig_UL → atom | 30 |
| RRAE05 | Compo_RA | atom → limitation → fig_LL → unit → limitation → fig_UL → unit | 28 |
| RRAE20 | Compo_RA | limitation → fig_LL → unit → limitation → fig_UL → unit → atom | 23 |
| RRAE03 | Compo_RA | atom → fig_LL → unit → limitation → fig_UL → unit → limitation | 16 |
| RRAE25 | Compo_RA | fig_LL → unit → limitation → fig_UL → unit → udef → atom | 10 |
| RRAE22 | Compo_RA | fig_LL → unit → limitation → fig_UL → unit → limitation → atom | 9 |
| RRAE14 | Compo_RA | fig_LL → unit → limitation → atom → limitation → fig_UL → unit | 8 |
| RRAE07 | Compo_RA | atom → fig_LL → unit → limitation → limitation → fig_UL → unit | 6 |
| RRAE09 | Compo_RA | atom → fig_UL → unit → limitation → limitation → fig_LL → unit | 6 |
| RRAE10 | Compo_RA | atom → limitation → fig_LL → unit → limitation → fig_UL → unit → udef | 6 |
| ROSE08 | Compo_OS | atom → limitation → fig_UL → unit → udef | 4 |
| RRAE26 | Compo_RA | limitation → fig_LL → unit → limitation → fig_UL → unit → udef → atom | 4 |
| ROSE14 | Compo_OS | fig_UL → limitation → atom | 3 |
| ROSE07 | Compo_OS | atom → fig_UL → unit → limitation → udef | 2 |
| RRAE08 | Compo_RA | atom → limitation → fig_LL → unit → fig_UL → unit → limitation | 2 |
| RRAE12 | Compo_RA | atom → fig_LL → unit → limitation → fig_UL → unit → udef | 2 |
| RRAE13 | Compo_RA | atom → fig_LL → limitation → fig_UL → unit → udef | 2 |
| RRAE23 | Compo_RA | fig_LL → unit → limitation → limitation → fig_UL → unit → atom | 2 |
| RRAE24 | Compo_RA | fig_UL → unit → limitation → limitation → fig_LL → unit → atom | 2 |
| RRAE11 | Compo_RA | atom → fig_LL → unit → limitation → limitation → fig_UL → unit → udef | 1 |
| RRAE15 | Compo_RA | fig_LL → limitation → atom → limitation → fig_UL → unit | 1 |
| RRAE16 | Compo_RA | fig_LL → limitation → atom → limitation → fig_UL | 1 |
Appendix B. Japanese Pattern Templates (Complete List)
| Pattern ID | Relation | Token Sequence | Count |
|---|---|---|---|
| RRA008 | Compo_RA | atom → fig_LL → limitation → fig_UL → unit | 1710 |
| ROS004 | Compo_OS | atom → fig_UL → unit → limitation | 968 |
| RRA006 | Compo_RA | atom → fig_LL → unit → limitation → fig_UL → unit → limitation | 254 |
| RRA007 | Compo_RA | fig_LL → limitation → fig_UL → unit → atom | 82 |
| RRA002 | Compo_RA | atom → fig_LL → unit → limitation → fig_UL → unit | 68 |
| ROS005 | Compo_OS | atom → limitation → fig_UL → unit | 52 |
| ROS002 | Compo_OS | atom → fig_LL → unit → limitation | 50 |
| RRA001 | Compo_RA | atom → fig_LL → limitation → fig_UL | 38 |
| RRA011 | Compo_RA | atom → fig_LL → limitation → fig_UL → unit → limitation | 36 |
| ROS007 | Compo_OS | fig_UL → unit → limitation → atom | 35 |
| ROS009 | Compo_OS | limitation → fig_UL → unit → atom | 24 |
| ROS008 | Compo_OS | fig_LL → unit → limitation → atom | 10 |
| ROS003 | Compo_OS | atom → fig_UL → limitation | 10 |
| RRA004 | Compo_RA | fig_LL → unit → limitation → fig_UL → unit → atom | 9 |
| RRA009 | Compo_RA | fig_LL → limitation → fig_UL → unit | 4 |
| ROS006 | Compo_OS | atom → limitation → fig_LL → unit | 4 |
| ROS001 | Compo_OS | atom → fig_LL → limitation | 3 |
| RRA003 | Compo_RA | fig_LL → limitation → fig_UL → atom | 2 |
Appendix C. Entity Label Frequency Statistics
| Label | Count | Ratio (%) |
|---|---|---|
| atom | 19,933 | 15.0 |
| unit | 12,557 | 9.4 |
| fig_UL | 11,548 | 8.7 |
| fig_LL | 9450 | 7.1 |
| use | 9015 | 6.8 |
| substance | 6776 | 5.1 |
| unit_def | 1573 | 1.2 |
| variable | 1346 | 1.0 |
| formula | 1205 | 0.9 |
| balance | 827 | 0.6 |
| fig | 628 | 0.5 |
| f_no | 562 | 0.4 |
| sum | 230 | 0.2 |
| limitation | 14,563 | 11.0 |
| selection | 2478 | 1.9 |
| SEP | 40,232 | 30.3 |
| Label | Count | Ratio (%) |
|---|---|---|
| atom | 6122 | 18.3 |
| unit | 4691 | 14.0 |
| fig_UL | 4170 | 12.4 |
| fig_LL | 2939 | 8.8 |
| use | 992 | 3.0 |
| variable | 594 | 1.8 |
| unit_def | 453 | 1.4 |
| balance | 382 | 1.1 |
| substance | 382 | 1.1 |
| formula | 365 | 1.1 |
| f_no | 285 | 0.9 |
| sum | 97 | 0.3 |
| fig | 19 | 0.1 |
| limitation | 4923 | 14.7 |
| selection | 955 | 2.8 |
| SEP | 6146 | 18.3 |
References
- Weston, L.; Tshitoyan, V.; Dagdelen, J.; Kononova, O.; Trewartha, A.; Persson, K.A.; Ceder, G.; Jain, A. Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. 2019, 59, 3692–3702. [Google Scholar] [CrossRef] [PubMed]
- Song, Y.; Miret, S.; Liu, B. MatSci-NLP: Evaluating Scientific Language Models on Materials Science Language Tasks Using Text-to-Schema Modeling. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Volume 1, pp. 3621–3639. [Google Scholar] [CrossRef]
- Jiang, X.; Wang, W.; Tian, S.; Wang, H.; Lookman, T.; Su, Y. Applications of natural language processing and large language models in materials discovery. npj Comput. Mater. 2025, 11, 79. [Google Scholar] [CrossRef]
- Tshitoyan, V.; Dagdelen, J.; Weston, L.; Dunn, A.; Rong, Z.; Kononova, O.V.; Persson, K.A.; Ceder, G.; Jain, A. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 2019, 571, 95–98. [Google Scholar] [CrossRef] [PubMed]
- Kononova, O.; He, T.; Huo, H.; Trewartha, A.; Olivetti, E.A.; Ceder, G. Opportunities and challenges of text mining in materials research. iScience 2021, 24, 102155. [Google Scholar] [CrossRef] [PubMed]
- Li, J.; Sun, A.; Han, J.; Li, C. A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 2020, 34, 50–70. [Google Scholar] [CrossRef]
- Yadav, V.; Bethard, S. A survey on recent advances in named entity recognition from deep learning models. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018; pp. 2145–2158. [Google Scholar]
- Jehangir, B.; Radhakrishnan, S.; Agarwal, R. A survey on Named Entity Recognition—Datasets, tools, and methodologies. Nat. Lang. Process. J. 2023, 3, 100017. [Google Scholar] [CrossRef]
- Hu, Z.; Hou, W.; Liu, X. Deep learning for named entity recognition: A survey. Neural Comput. Appl. 2024, 36, 8995–9022. [Google Scholar] [CrossRef]
- Keraghel, I.; Morbieu, S.; Nadif, M. Recent Advances in Named Entity Recognition: A Comprehensive Survey and Comparative Study. arXiv 2024, arXiv:2401.10825. [Google Scholar]
- Sakai, T.; Chiwata, N.; Mine, T. Named Entity Recognition with Clue-Word Tags From Patent Documents in Materials Science. IEEE Access 2026, 14, 38332–38346. [Google Scholar] [CrossRef]
- Vaucher, A.C.; Zipoli, F.; Geluykens, J.; Nair, V.H.; Schwaller, P.; Laino, T. Automated extraction of chemical synthesis actions from experimental procedures. Nat. Commun. 2020, 11, 3601. [Google Scholar] [CrossRef] [PubMed]
- Jiang, L.; Goetz, S.M. Natural language processing in the patent domain: A survey. Artif. Intell. Rev. 2025, 58, 214. [Google Scholar] [CrossRef]
- Wang, Y.; Yu, B.; Zhu, H.; Liu, T.; Yu, N.; Sun, L. Discontinuous Named Entity Recognition as Maximal Clique Discovery. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Online, 1–6 August 2021; Volume 1, pp. 764–774. [Google Scholar] [CrossRef]
- Wang, J.; Shou, L.; Chen, K.; Chen, G. Pyramid: A Layered Model for Nested Named Entity Recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 5918–5928. [Google Scholar] [CrossRef]
- Cabral, R.C.; Han, S.C.; Alhassan, A.; Batista-Navarro, R.; Nenadic, G.; Poon, J. TriG-NER: Triplet-Grid Framework for Discontinuous Named Entity Recognition. In WWW’25: Proceedings of the ACM on Web Conference 2025, Sydney, Australia, 28 April–2 May 2025; Association for Computing Machinery: New York, NY, USA, 2025; pp. 2824–2837. [Google Scholar] [CrossRef]
- Ramshaw, L.A.; Marcus, M.P. Text Chunking Using Transformation-Based Learning. In Proceedings of the Third Workshop on Very Large Corpora; Massachusetts Institute of Technology: Cambridge, MA, USA, 1995; pp. 82–94. [Google Scholar]
- Ratinov, L.; Roth, D. Design Challenges and Misconceptions in Named Entity Recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL), Boulder, CO, USA, 4–5 June 2009; pp. 147–155. [Google Scholar] [CrossRef]
- Lee, K.; He, L.; Lewis, M.; Zettlemoyer, L. End-to-end Neural Coreference Resolution. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, 7–11 September 2017; pp. 188–197. [Google Scholar] [CrossRef]
- Fu, J.; Huang, X.; Liu, P. SpanNER: Named Entity Re-/Recognition as Span Prediction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL-IJCNLP), Online, 1–6 August 2021; pp. 7183–7195. [Google Scholar] [CrossRef]
- Wang, S.; Sun, X.; Li, X.; Ouyang, R.; Wu, F.; Zhang, T.; Li, J.; Wang, G.; Guo, C. GPT-NER: Named Entity Recognition via Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025, Albuquerque, NM, USA, 29 April–4 May 2025; pp. 4257–4275. [Google Scholar] [CrossRef]
- Zhou, W.; Zhang, S.; Gu, Y.; Chen, M.; Poon, H. UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition. arXiv 2024, arXiv:2308.03279. [Google Scholar]
- Kim, E.; Huang, K.; Tomala, A.; Matthews, S.; Strubell, E.; Saunders, A.; McCallum, A.; Olivetti, E. Machine-learned and codified synthesis parameters of oxide materials from scientific literature. Sci. Data 2017, 4, 170127. [Google Scholar] [CrossRef] [PubMed]
- Gupta, T.; Zaki, M.; Krishnan, N.M.A.; Mausam, M. MatSciBERT: A materials domain language model for text mining and information extraction. npj Comput. Mater. 2022, 8, 102. [Google Scholar] [CrossRef]
- Trewartha, A.; Walker, N.; Huo, H.; Lee, S.; Cruse, K.; Dagdelen, J.; Dunn, A.; Persson, K.A.; Ceder, G.; Jain, A. Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science. Patterns 2022, 3, 100488. [Google Scholar] [CrossRef] [PubMed]
- Mavračić, J.; Court, C.J.; Isazawa, T.; Elliott, S.R.; Cole, J.M. ChemDataExtractor 2.0: Autopopulated Ontologies for Materials Science. J. Chem. Inf. Model. 2021, 61, 4280–4289. [Google Scholar] [CrossRef] [PubMed]
- Huang, Z.; He, L.; Yang, Y.; Li, A.; Zhang, Z.; Wu, S.; Wang, Y.; He, Y.; Liu, X. Application of machine reading comprehension techniques for named entity recognition in materials science. J. Cheminform. 2024, 16, 76. [Google Scholar] [CrossRef] [PubMed]
- Foppiano, L.; Lambard, G.; Amagasa, T.; Ishii, M. Mining experimental data from materials science literature with large language models: An evaluation study. Sci. Technol. Adv. Mater. Methods 2024, 4, 2356506. [Google Scholar] [CrossRef]
- Potu, S.T.; Niranjan Murthy, R.; Thomas, A.; Mishra, L.; Prange, N.; Durmaz, A.R. Ontology-conformal recognition of materials entities using language models. Sci. Rep. 2025, 15, 18597. [Google Scholar] [CrossRef] [PubMed]
- He, P.; Liu, X.; Gao, J.; Chen, W. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. arXiv 2021, arXiv:2006.03654. [Google Scholar]
- Luan, Y.; He, L.; Ostendorf, M.; Hajishirzi, H. Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium, 31 October–4 November 2018; pp. 3219–3232. [Google Scholar] [CrossRef]
- Wadden, D.; Wennberg, U.; Luan, Y.; Hajishirzi, H. Entity, Relation, and Event Extraction with Contextualized Span Representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 5784–5789. [Google Scholar] [CrossRef]
- Hosseini-Asl, E.; McCann, B.; Wu, C.S.; Yavuz, S.; Socher, R. A Simple Language Model for Task-Oriented Dialogue. In NIPS’20: Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 20179–20191. [Google Scholar]
- Soares, L.B.; FitzGerald, N.; Ling, J.; Kwiatkowski, T. Matching the Blanks: Distributional Similarity for Relation Learning. CoRR 2019, abs/1906.03158. [Google Scholar]
- Zhou, W.; Chen, M. An Improved Baseline for Sentence-level Relation Extraction. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, Online, 20–23 November 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; Volume 2, pp. 161–168. [Google Scholar] [CrossRef]
- Sainz, O.; García-Ferrero, I.; Agerri, R.; Lacalle, O.; Rigau, G.; Agirre, E. GoLLIE: Annotation Guidelines improve Zero-Shot Information-Extraction. In Proceedings of the International Conference on Learning Representations 2024, Vienna, Austria, 7–11 May 2024; Volume 2024, pp. 47083–47107. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, USA, 2–7 June 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
- Nakayama, H. Seqeval: A Python Framework for Sequence Labeling Evaluation. 2018. Available online: https://github.com/chakki-works/seqeval (accessed on 19 April 2026).
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30, pp. 5998–6008. [Google Scholar]



| Label | Type | Description and Example |
|---|---|---|
| atom | content | Chemical element or material name (Fe, aluminum, carbon) |
| fig_LL | content | Lower bound numeric value (0.01, 1.0) |
| fig_UL | content | Upper bound numeric value (0.3, 5.0) |
| fig | content | Single numeric value (no bound distinction) (5) |
| unit | content | Unit of measurement (mass%, wt%, %) |
| unit_def | content | Unit definition prefix (in mass %, by weight) |
| use | content | Intended use or product name (steel, alloy) |
| substance | content | Compound or substance name (oxide, carbide) |
| variable | content | Variable or parameter symbol (x, n) |
| formula | content | Chemical formula (Fe3C, Al2O3) or formula ((Mo + W) ≧ 4.3%) |
| balance | content | Remainder/balance expression (balance, remainder) |
| f_no | content | Formula or figure reference number (Formula (1)) |
| sum | content | Sum or total expression (total, in sum) |
| limitation | structural | Constraint or comparison keyword (to, or less, at most, ≤) |
| selection | structural | Selection expression (one or more of, selected from) |
| SEP | structural | Composition expression pattern boundary separator (wherein, comprising) |
| Axis | Name | Comparison Unit | Model | Overview |
|---|---|---|---|---|
| 1 | Entity-level F1 score | Token (entity span) | Both | Standard named entity recognition accuracy. Structural labels excluded. |
| 2 | Correct span exact match rate | Entity sequence | Both | Ground-truth SEP-tag spans provided. Isolates entity labeling. |
| 3 | Predicted span pattern extraction F1 score | Entity sequence | The model with SEP-tags | Self-predicted SEP-tag boundaries. End-to-end span plus entity. |
| 4 | Pattern extraction F1 score | Pattern ID | Both | No boundary hint. Most practical evaluation. |
| Token Sequence | Relation | En ID | Ja ID | Count (En/Ja) |
|---|---|---|---|---|
| atom → fig_LL → limitation → fig_UL → unit | Compo_RA | RRAE04 | RRA008 | 95/1710 |
| atom → fig_LL → unit → limitation | Compo_OS | ROSE01 | ROS002 | 133/50 |
| atom → fig_LL → limitation → fig_UL | Compo_RA | RRAE01 | RRA001 | 125/38 |
| atom → fig_LL → unit → limitation → fig_UL → unit | Compo_RA | RRAE02 | RRA002 | 47/68 |
| fig_UL → unit → limitation → atom | Compo_OS | ROSE09 | ROS007 | 44/35 |
| Model/Hyperparameter | Value |
|---|---|
| English base model | roberta-base |
| Japanese base model | cl-tohoku/bert-base-japanese-whole-word-masking |
| Maximum sequence length | 512 tokens |
| Batch size | 8 |
| Learning rate | (AdamW [40]) |
| LR scheduler | Linear warmup (10%) + linear decay |
| Maximum epochs | 20 |
| Early stopping patience | 5 epochs |
| (Left) | (Right) | ||||||
|---|---|---|---|---|---|---|---|
| Model | Precision | Recall | F1 | Model | Precision | Recall | F1 |
| without SEP-tags | 0.9095 | 0.9310 | 0.9201 | without SEP-tags | 0.8875 | 0.9111 | 0.8992 |
| with SEP-tags | 0.9110 | 0.9324 | 0.9216 | with SEP-tags | 0.8853 | 0.9119 | 0.8984 |
| Diff. | +0.0015 | +0.0014 | +0.0015 | Diff. | −0.0023 | +0.0007 | −0.0008 |
| Axes | Metric | English (10,166) | Japanese (975) | ||||
|---|---|---|---|---|---|---|---|
| w/o SEP-Tags | w/ SEP-Tags | Diff. | w/o SEP-Tags | w/ SEP-Tags | Diff. | ||
| 2 | Exact match rate | 0.1595 | 0.7567 | +0.5972 | 0.1754 | 0.8655 | +0.6901 |
| 3 | Precision | — | 0.8680 | — | — | 0.7830 | — |
| 3 | Recall | — | 0.8937 | — | — | 0.6014 | — |
| 3 | F1 score | — | 0.8807 | — | — | 0.6803 | — |
| 4 | Precision | 0.0412 | 0.8381 | +0.7969 | 0.0185 | 0.5613 | +0.5428 |
| 4 | Recall | 0.8354 | 0.8630 | +0.0276 | 0.7826 | 0.4312 | −0.3514 |
| 4 | F1 score | 0.0784 | 0.8503 | +0.7719 | 0.0361 | 0.4877 | +0.4516 |
| 4 | False positives | 54,404 | 466 | 11,485 | 93 | ||
| Label | English (10,166) | Japanese (975) | ||
|---|---|---|---|---|
| Mean Diff | p-Value | Mean Diff | p-Value | |
| atom | −0.0126 | 0.0955 | — | — |
| balance | — | — | −0.0179 | 0.1797 |
| fig | +0.0595 | 0.1683 | — | — |
| fig_UL | — | — | +0.0089 | 0.0754 |
| substance | — | — | −0.0306 | 0.2619 |
| sum | +0.0564 | 0.1688 | — | — |
| unit | — | — | −0.0073 | 0.2845 |
| variable | +0.0148 | 0.1048 | — | — |
| (all BH-corrected , 0/13 significant after BH correction) | ||||
| Dataset | n | Without SEP-Tags | With SEP-Tags | Diff. | p-Value |
|---|---|---|---|---|---|
| English | 10,166 | 0.8190 | 0.8212 | +0.0023 | 0.0147 |
| Japanese | 975 | 0.8652 | 0.8625 | −0.0027 | 0.3308 |
| Pattern | Support | Without SEP-Tags | With SEP-Tags | F1 | ||||
|---|---|---|---|---|---|---|---|---|
| Precision | Recall | F1 Score | Precision | Recall | F1 Score | |||
| RRAE19 | 116 | 0.014 | 0.155 | 0.025 | 0.895 | 0.879 | 0.887 | +0.862 |
| RTOE01 | 106 | 0.038 | 0.292 | 0.067 | 0.791 | 0.821 | 0.806 | +0.738 |
| RRAE04 | 40 | 0.000 | 0.000 | 0.000 | 0.791 | 0.850 | 0.819 | +0.819 |
| ROSE11 | 29 | 0.002 | 0.034 | 0.004 | 0.844 | 0.931 | 0.885 | +0.881 |
| ROSE02 | 21 | 0.000 | 0.000 | 0.000 | 0.714 | 0.714 | 0.714 | +0.714 |
| RRAE21 | 20 | 0.010 | 0.050 | 0.017 | 1.000 | 0.650 | 0.788 | +0.771 |
| ROSE03 | 20 | 0.200 | 0.050 | 0.080 | 0.737 | 0.700 | 0.718 | +0.638 |
| RRAE18 | 19 | 0.004 | 0.053 | 0.008 | 0.778 | 0.737 | 0.757 | +0.749 |
| ROSE04 | 14 | 0.000 | 0.000 | 0.000 | 0.800 | 0.857 | 0.828 | +0.828 |
| RRAE06 | 14 | 0.000 | 0.000 | 0.000 | 0.833 | 0.714 | 0.769 | +0.769 |
| RRAE02 | 10 | 0.067 | 0.100 | 0.080 | 0.750 | 0.900 | 0.818 | +0.738 |
| OTHER | 2319 | 0.042 | 0.928 | 0.080 | 0.841 | 0.873 | 0.857 | +0.776 |
| SEP-Tags F1 Score Range | Samples | Axis 4 Average F1 Score |
|---|---|---|
| [0.8, 1.0] | 2214 | 0.899 |
| [0.6, 0.8) | 146 | 0.476 |
| [0.4, 0.6) | 39 | 0.252 |
| [0.2, 0.4) | 8 | 0.125 |
| Category | En Example | Ja Example | En Coverage | Ja Coverage |
|---|---|---|---|---|
| Composition Enumeration Starters | comprising, wherein | karanaru, woganyuushi | 30.2% | 17.0% |
| Composition Enumeration Separators | ‘,’ ‘;’ and | oyobi, narabini | 53.3% | 66.3% |
| Clause/Sentence Terminator | ‘.’ | dearu | 15.4% | 12.5% |
| Other | - | - | 1.1% | 4.1% |
| Dataset | Unique SEP Types | Non-Overlapping | Overlapping |
|---|---|---|---|
| English (1000 human) | 165 | 160 (97.0%) | 5 |
| Japanese (975) | 299 | 296 (99.0%) | 3 |
| Method | Training Cost | Inference Cost | Constraints |
|---|---|---|---|
| BIO/BIOES | Low () | Low () | Cannot be patterned |
| SEP-tags | Low () | Low () | Initial annotation only |
| LLM | None | High (API) † | Privacy and reproducibility |
| Span-based | Medium () | Medium () | Span enumeration required |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Sakai, T.; Chiwata, N.; Mine, T. Extracting Composition Expression Patterns from Materials Science Patent Documents Using SEP-Tags. Big Data Cogn. Comput. 2026, 10, 217. https://doi.org/10.3390/bdcc10070217
Sakai T, Chiwata N, Mine T. Extracting Composition Expression Patterns from Materials Science Patent Documents Using SEP-Tags. Big Data and Cognitive Computing. 2026; 10(7):217. https://doi.org/10.3390/bdcc10070217
Chicago/Turabian StyleSakai, Toshihiko, Nobuhiko Chiwata, and Tsunenori Mine. 2026. "Extracting Composition Expression Patterns from Materials Science Patent Documents Using SEP-Tags" Big Data and Cognitive Computing 10, no. 7: 217. https://doi.org/10.3390/bdcc10070217
APA StyleSakai, T., Chiwata, N., & Mine, T. (2026). Extracting Composition Expression Patterns from Materials Science Patent Documents Using SEP-Tags. Big Data and Cognitive Computing, 10(7), 217. https://doi.org/10.3390/bdcc10070217

