STAR: Steelmaking Task-Aware Routing for Multi-Agent LLM Expert Systems
Abstract
1. Introduction
- We present an end-to-end workflow for building a steelmaking-oriented routing framework, covering OCR-based text preprocessing and quality scoring, fine-grained process-domain definitions, LLM-assisted query construction, and domain-labeled vector index construction, culminating in a practical multi-stage router design [9,10].
- We construct a fine-grained steel-domain question set and a domain-labeled vectorized knowledge space. By organizing metallurgy texts alongside process domains, we generate and label typical engineering queries and build a FAISS-based index with domain metadata, providing a data foundation for routing evaluation and domain-scoped retrieval integration [3,6].
- We design a three-stage process-domain router that combines rule-based heuristics, retrieval-based neighbor voting, and LLM-based refinement. We further provide a router-plus-agents integration blueprint in which routing labels map-to-domain-specific prompting and retrieval scopes, enabling stage-aware query dispatching while remaining extensible to additional domains and components [5,7,8].
2. Related Work
3. Method
3.1. Overall System Architecture
3.2. Metallurgical Corpus Construction and Question Generation
3.2.1. OCR Text Preprocessing and Quality Assessment
3.2.2. Corpus Statistics, Quality Evidence, and Compliance
3.2.3. Fine-Grained Domain Partitioning and Corpus Annotation
3.3. Vectorized Knowledge Space and Retrieval Module
3.3.1. Semantic Encoding and Vector Space Construction
3.3.2. FAISS Index and Similar Question Retrieval
3.4. Multi-Stage Question Routing Mechanism
| Algorithm 1: Multi-stage question routing mechanism. | |
| Require: Query q; steel question bank with -normalized vectors; domain set ; parameters | |
| Ensure: Routing label ▹Stage 1: Ultra-fast filtering | |
| 1: | ▹optional: short-query guard; strip URLs/emojis for noisy long texts |
| 2: | if q matches the chit-chat keyword set then return general_llm |
| 3: | end if |
| 4: | ; |
| 5: | ▹ Equation (8) |
| 6: | if then return general_llm |
| 7: | end if ▹Stage 2: Retrieval routing |
| 8: | Retrieve Top-k neighbors by cosine score ▹ Equation (9) |
| 9: | Compute for by Equations (10) and (11) |
| 10: | ; ▹ Equation (12) |
| 11: | if then return |
| 12: | end if ▹Stage 3: LLM-based fine-grained routing |
| 13: | Build a routing prompt from q, , and Top-k neighbors; query routing LLM |
| 14: | if JSON parsing succeeds and Top-1 domain then |
| 15: | return Top-1 predicted domain |
| 16: | else |
| 17: | return general_llm |
| 18: | end if |
3.4.1. Security Considerations and Prompt-Injection Resilience
3.4.2. Ultra-Fast Filtering Stage: Steel vs. General Classification
3.4.3. Retrieval Routing Stage: Domain Determination via Nearest-Neighbor Voting
- Retrieval Score and Top-k Selection
3.4.4. LLM-Based Fine-Grained Routing Stage: Handling Complex/Boundary Cases
3.4.5. Threshold Tuning and Low-Confidence Handling
4. Experiments
4.1. Experimental Setup
4.1.1. Datasets and Splits
4.1.2. Query Sources and Labeling Protocol
- 1.
- Steel-domain questions. We use automatically generated and annotated questions from the high-quality metallurgy corpus , covering eight domains: raw materials and ironmaking, steelmaking and secondary refining, continuous casting, rolling, heat treatment, steel grade design, defects and quality, and production and green/low-carbon metallurgy. This subset contains approximately 3136 instances with an approximately balanced domain distribution, ensuring that each domain has sufficient samples for evaluation. We evaluate fine-grained domain routing (Top-1 accuracy and macro-F1) on the 8-way steel-domain subset only.
- 2.
- Non-steel questions. Non-steel questions (e.g., chit-chat, general writing, and everyday consulting) are sampled from real dialogue logs or open-source Chinese QA corpora. All such instances are labeled as general_llm and used to evaluate the steel-versus-general classification capability of the ultra-fast filtering stage. This subset contains about 2000 instances. We evaluate the steel-versus-general filtering stage as a binary classification problem on the mixed set; in deployment, predicted general queries are routed to general_llm.
4.1.3. Implementation Details
4.1.4. Evaluation Metrics
4.2. Routing Performance Evaluation
4.2.1. Latency, Cost, and Failure Modes
4.2.2. End-to-End Online Latency and Comparison to Conventional RAG
4.2.3. Domain Routing Results
4.2.4. Steel vs. General Routing Results
4.2.5. Confusion Matrix Analysis
4.2.6. Robustness to Short, Ambiguous, and Multi-Intent Queries
4.3. Baseline Comparison
4.4. Ablation Study
4.4.1. Stage 1 Ablations: Steel-Versus-General Filtering
4.4.2. Stage 2/3 Ablations: Confidence Thresholding and LLM Refinement
4.4.3. Hyperparameter Sensitivity
- Stage 1: distance threshold .
- Stage 2: confidence threshold .
- Stage 3: Top-k neighbors.

5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Ghosh, A.; Chatterjee, A. Ironmaking and Steelmaking: Theory and Practice; PHI Learning: New Delhi, India, 2008. [Google Scholar]
- Merten, D. Decision Support Systems for Steel Production Planning—State of the Art and Open Questions. In Steel 4.0: Digitalization in Steel Industry; Uygun, Y., Özgür, A., Hütt, M.T., Eds.; Springer: Cham, Switzerland, 2024; pp. 73–83. [Google Scholar]
- Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A Survey of Large Language Models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; tau Yih, W.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv 2021, arXiv:2005.11401. [Google Scholar]
- Wu, T.; Li, J.; Bao, J.; Liu, Q. Language model-driven multi-agent systems for improving production efficiency and reducing carbon emissions in manufacturing. Comput. Ind. Eng. 2025, 207, 111299. [Google Scholar] [CrossRef]
- Jacobs, R.A.; Jordan, M.I.; Nowlan, S.J.; Hinton, G.E. Adaptive Mixtures of Local Experts. Neural Comput. 1991, 3, 79–87. [Google Scholar] [CrossRef]
- Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In Proceedings of the International Conference on Learning Representations (ICLR 2017), Toulon, France, 24–26 April 2017. [Google Scholar]
- Ding, Y.; Luo, S.; Dai, Y.; Jiang, Y.; Li, Z.; Martin, G.; Peng, Y. A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends. arXiv 2025, arXiv:2507.09861. [Google Scholar] [CrossRef]
- Zhang, Z.; Zhang, Y.; Liang, Y.; Xiang, L.; Zhao, Y.; Zhou, Y.; Zong, C. From Chaotic OCR Words to Coherent Document: A Fine-to-Coarse Zoom-Out Network for Complex-Layout Document Image Translation. In Proceedings of the 31st International Conference on Computational Linguistics (COLING 2025), Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 10877–10890. [Google Scholar]
- Li, Y.; Zhao, H.; Jiang, H.; Pan, Y.; Liu, Z.; Wu, Z.; Shu, P.; Tian, J.; Yang, T.; Xu, S.; et al. Large language models for manufacturing. arXiv 2024, arXiv:2410.21418. [Google Scholar] [PubMed]
- Zhang, C.; Zhou, G.; Liu, Y.; Zhou, G.; Zeng, K.; Chang, F.; Ding, K. A survey on potentials, pathways and challenges of large language models in new-generation intelligent manufacturing. Robot. Comput.-Integr. Manuf. 2025, 92, 102883. [Google Scholar] [CrossRef]
- Jiang, T.; Zhu, D.; Wu, H.; Mao, X. Large language models empowering the steel industry: Technology and application outlook. Yejin Zidonghua 2025, 49, 1–17. (In Chinese) [Google Scholar]
- Du, K.; Yang, B.; Xie, K.; Dong, N.; Zhang, Z.; Wang, S.; Mo, F. LLM-MANUF: An integrated framework of fine-tuning large language models for intelligent decision-making in manufacturing. Adv. Eng. Inform. 2025, 65, 103263. [Google Scholar] [CrossRef]
- Chandrasekhar, A.; Chan, J.; Ogoke, F.; Ajenifujah, O.; Barati Farimani, A. AMGPT: A large language model for contextual querying in additive manufacturing. Addit. Manuf. Lett. 2024, 11, 100232. [Google Scholar] [CrossRef]
- Khan, M.T.; Chen, L.; Feng, W.; Moon, S.K. Large language model-powered decision support for a metal additive manufacturing knowledge graph. arXiv 2025, arXiv:2505.20308. [Google Scholar]
- Fan, H.; Fan, Z.; Liu, C.; Zhu, J.; Gibbs, T.; Fuh, J.Y.H.; Lu, W.F.; Li, B. MetalMind: A knowledge graph-driven human-centric knowledge system for metal additive manufacturing. npj Adv. Manuf. 2025, 2, 25. [Google Scholar] [CrossRef]
- Li, S.; Corney, J. MechRAG: A multimodal large language model for mechanical engineering. Commun. Eng. 2025, 4, 187. [Google Scholar] [CrossRef]
- Fu, T.; Liu, S.; Li, P. Intelligent smelting process management system: Efficient and intelligent management strategy by incorporating large language model. Front. Eng. Manag. 2024, 11, 396–412. [Google Scholar] [CrossRef]
- Zhang, H.; Gu, J.; Sun, Y.; Zheng, Q.; Li, M. StiBench: An understanding benchmark for large language models in the steel metallurgy domain. Yejin Zidonghua 2025, 49, 102–111. (In Chinese) [Google Scholar] [CrossRef]
- Hu, Q.J.; Bieker, J.; Li, X.; Jiang, N.; Keigwin, B.; Ranganath, G.; Keutzer, K.; Upadhyay, S.K. RouterBench: A benchmark for multi-LLM routing system. arXiv 2024, arXiv:2403.12031. [Google Scholar]
- Ong, I.; Almahairi, A.; Wu, V.; Chiang, W.L.; Wu, T.; Gonzalez, J.E.; Kadous, M.W.; Stoica, I. RouteLLM: Learning to route LLMs with preference data. arXiv 2024, arXiv:2406.18665. [Google Scholar] [CrossRef]
- Jitkrittum, W.; Narasimhan, H.; Rawat, A.S.; Juneja, J.; Wang, C.; Wang, Z.; Go, A.; Lee, C.Y.; Shenoy, P.; Panigrahy, R.; et al. Universal model routing for efficient LLM inference. arXiv 2025, arXiv:2502.08773. [Google Scholar] [CrossRef]
- Fu, T.; Ge, Y.; You, Y.; Liu, E.; Yuan, Z.; Dai, G.; Yan, S.; Yang, H.; Wang, Y. R2R: Efficiently navigating divergent reasoning paths with small-large model token routing. arXiv 2025, arXiv:2505.21600. [Google Scholar]
- Wu, Q.; Bansal, G.; Zhang, J.; Wu, Y.; Li, B.; Zhu, E.; Jiang, L.; Zhang, X.; Zhang, S.; Liu, J.; et al. AutoGen: Enabling next-gen LLM applications via multi-agent conversation. arXiv 2023, arXiv:2308.08155. [Google Scholar]
- Wu, T.; Li, J.; Bao, J.; Liu, Q. ProcessCarbonAgent: A large language models-empowered autonomous agent for decision-making in manufacturing carbon emission management. J. Manuf. Syst. 2024, 76, 429–442. [Google Scholar] [CrossRef]
- Li, M.; Wang, R.; Zhou, X.; Zhu, Z.; Wen, Y.; Tan, R. ChatTwin: Toward Automated Digital Twin Generation for Data Center via Large Language Models. In Proceedings of the 10th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation (BuildSys ’23), Istanbul Turkey, 15–16 November 2023. [Google Scholar]
- Yang, J.; Li, S.; Wang, X.; Lu, J.; Wu, H.; Wang, X. DeFACT in ManuVerse for parallel manufacturing: Foundation models and parallel workers in smart factories. IEEE Trans. Syst. Man, Cybern. Syst. 2023, 53, 2188–2199. [Google Scholar] [CrossRef]
- Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019), Hong Kong, China, 3–7 November 2019; pp. 3982–3992. [Google Scholar]
- Douze, M.; Guzhva, A.; Deng, C.; Johnson, J.; Szilvasy, G.; Mazaré, P.E.; Lomeli, M.; Hosseini, L.; Jégou, H. The Faiss Library. arXiv 2024, arXiv:2401.08281. [Google Scholar] [CrossRef]
- Johnson, J.; Douze, M.; Jégou, H. Billion-scale similarity search with GPUs. arXiv 2017, arXiv:1702.08734. [Google Scholar] [CrossRef]


| Item | Description | Before Filtering | After Filtering |
|---|---|---|---|
| Documents | textbooks/monographs/papers | 36 | 36 |
| Pages | scanned pages processed | 7800 | 7800 |
| Segments | paragraph-level OCR segments | 305,000 | 228,000 |
| Retention rate | kept/total | – | 0.75 |
| Avg. segment length | characters per segment | 210 | 240 |
| Usable rate (human) | % labeled as usable | 0.62 | 0.86 |
| Unusable rate (human) | % labeled as unusable | 0.21 | 0.05 |
| Subset | Source | Labeling Method | Size |
|---|---|---|---|
| Steel-domain (main) | OCR metallurgy corpus → LLM question synthesis | LLM labeling under domain definitions; JSON/schema validation + rule checks | 3136 |
| Non-steel | General QA/dialog corpora and dialog logs | Fixed label general_llm | ≈2000 |
| Expert-validated | Stratified sample from the steel-domain set (8 domains; 50 per domain) | 3 experts: 2 independent labels; 1 adjudicates conflicts | 400 |
| Component | Avg. (ms) | Rate | Notes |
|---|---|---|---|
| Stage 1 (filter) | 14 | 1.000 | rules + embedding; 1-NN distance |
| Stage 2 (retrieval vote) | 0.7 | 0.943 | FAISS Top-k; runs if Stage 1 predicts steel |
| Stage 3 (LLM refine) | 650 | 0.283 | triggered when ; main cost |
| Overall (end-to-end) | 199 | – | (avg.) |
| System | Route (ms) | Retr. (ms) | Gen. (ms) | Notes |
|---|---|---|---|---|
| Conventional RAG (single index) | 0 | 10 | 1200 | shared index; shared prompt |
| STAR (router + domain RAG) | 199 | 8 | 1210 | domain-scoped index/prompt |
| Domain ID | Precision | Recall | F1 |
|---|---|---|---|
| raw_ironmaking | 0.76 | 0.94 | 0.84 |
| steelmaking_refining | 0.85 | 0.93 | 0.89 |
| continuous_casting | 0.92 | 0.92 | 0.92 |
| rolling_control | 0.91 | 0.95 | 0.93 |
| heat_treatment | 0.93 | 0.80 | 0.86 |
| grade_design | 0.93 | 0.95 | 0.94 |
| defect_qc | 0.98 | 0.84 | 0.90 |
| prod_green | 0.97 | 0.86 | 0.91 |
| Level | Task/Class | Precision | Recall | F1 | Accuracy |
|---|---|---|---|---|---|
| Fine-grained domain routing | 8 steel domains | – | – | 0.899 | 0.921 |
| Fast filter (steel vs. general) | Steel (1) | 0.647 | 0.999 | 0.785 | – |
| Fast filter (steel vs. general) | General (0) | 0.990 | 0.144 | 0.251 | – |
| Fast filter (steel vs. general) | Overall (binary) | – | – | – | 0.666 |
| Subset | Size | Top-1 Acc. | Macro-F1 | S3 Inv. | Notes |
|---|---|---|---|---|---|
| Short queries (≤12 chars) | 620 | 0.890 | 0.870 | 0.420 | underspecified intents |
| Ambiguous (Stage 2 ) | 940 | 0.880 | 0.860 | 1.000 | Stage 3 frequently triggered |
| Multi-intent (heuristic/manual) | 280 | 0.840 | 0.810 | 0.680 | may require multi-domain routing |
| Method | Top-1 Acc. | Macro-F1 |
|---|---|---|
| Embedding-only: NN (k = 1) | 0.848 | 0.690 |
| Retrieval-only: voting (k = 5) | 0.871 | 0.725 |
| Supervised: LR on embeddings | 0.907 | 0.853 |
| LLM-only (no retrieval examples) | 0.832 | 0.713 |
| Proposed multi-stage router | 0.921 | 0.899 |
| Variant | Steel FNR | General FPR |
|---|---|---|
| Stage 1 (full) | 0.001 | 0.856 |
| w/o chit-chat keyword filter | 0.003 | 0.930 |
| w/o distance threshold () | 0.002 | 0.995 |
| Variant | Top-1 Acc. | Macro-F1 | S2 Cov. | S3 Inv. |
|---|---|---|---|---|
| Proposed router (full) | 0.921 | 0.899 | 0.700 | 0.300 |
| Stage 2 only (no Stage 3) | 0.871 | 0.725 | 1.000 | 0.000 |
| Stage 3 w/o retrieved neighbors | 0.895 | 0.830 | 0.700 | 0.300 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Liu, W.; Huang, C.; Wang, S.; Wang, L.; Meng, F.; Li, M.; Zhang, H.; Zheng, Q. STAR: Steelmaking Task-Aware Routing for Multi-Agent LLM Expert Systems. Electronics 2026, 15, 720. https://doi.org/10.3390/electronics15040720
Liu W, Huang C, Wang S, Wang L, Meng F, Li M, Zhang H, Zheng Q. STAR: Steelmaking Task-Aware Routing for Multi-Agent LLM Expert Systems. Electronics. 2026; 15(4):720. https://doi.org/10.3390/electronics15040720
Chicago/Turabian StyleLiu, Wenyuan, Chengyan Huang, Songlei Wang, Lin Wang, Fanjie Meng, Minghui Li, Haoning Zhang, and Qiang Zheng. 2026. "STAR: Steelmaking Task-Aware Routing for Multi-Agent LLM Expert Systems" Electronics 15, no. 4: 720. https://doi.org/10.3390/electronics15040720
APA StyleLiu, W., Huang, C., Wang, S., Wang, L., Meng, F., Li, M., Zhang, H., & Zheng, Q. (2026). STAR: Steelmaking Task-Aware Routing for Multi-Agent LLM Expert Systems. Electronics, 15(4), 720. https://doi.org/10.3390/electronics15040720

