Large Language Models for High-Entropy Alloys: Literature Mining, Design Orchestration, and Evaluation Standards

Guo, Yutong; Yang, Chao

doi:10.3390/met16020162

Open AccessReview

Large Language Models for High-Entropy Alloys: Literature Mining, Design Orchestration, and Evaluation Standards

by

Yutong Guo

^1,2 and

Chao Yang

^1,3,*

¹

Shanghai Key Lab of Advanced High-Temperature Materials and Precision Forming, School of Materials Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

²

International School of Information and Software, Dalian University of Technology, Dalian 116024, China

³

Inner Mongolia Research Institute, Shanghai Jiao Tong University, Hohhot 010010, China

^*

Author to whom correspondence should be addressed.

Metals 2026, 16(2), 162; https://doi.org/10.3390/met16020162

Submission received: 18 December 2025 / Revised: 22 January 2026 / Accepted: 22 January 2026 / Published: 29 January 2026

(This article belongs to the Special Issue Advances in High-Entropy Alloys’ Microstructure, Properties and Preparation)

Download

Browse Figures

Versions Notes

Abstract

High-entropy alloys (HEAs) present a fundamental design paradox: their exceptional properties arise from complex, high-dimensional composition–process–microstructure–property (CPMP) relationships, yet the knowledge needed to navigate this space is fragmented across a vast and unstructured literature. Large language models (LLMs) offer a transformative interface to this complexity. By extracting structured facts from text, they can convert dispersed and heterogeneous evidence (i.e., findings scattered across many studies and reported with inconsistent test protocols or characterization standards) into queryable knowledge graphs. Through code generation and tool composition, they can automate simulation pipelines, surrogate model construction, and inverse design workflows. This review analyzes how LLMs can augment key stages of HEA research—from intelligent literature mining and multimodal data integration (using LLMs to automatically extract and structure data from texts and to combine information across text, images, and other data sources) to model-driven design and closed-loop experimentation—illustrated by emerging case studies. We propose concrete evaluation protocols that measure direct scientific utility, including knowledge-graph completeness, workflow setup efficiency, and experimental validation hit rates. We also confront practical limitations: data sparsity and noise, model hallucination, domain bias (where models may exhibit superior predictive performance for specific, well-represented alloy systems over others due to imbalances in training data), and the imperative for reproducible infrastructure. We argue that domain-specialized LLMs, embedded within grounded, verifiable research systems, can not only accelerate HEA discovery but also standardize the representation, sharing, and reuse of community knowledge.

Keywords:

high-entropy alloys; large language models; materials informatics; artificial intelligence; scientific discovery; machine learning; autonomous research

1. Introduction

High-entropy alloys (HEAs) expand alloy design beyond conventional single-base paradigms by combining multiple principal elements in near-equiatomic proportions. Since their emergence in 2004, HEAs have stimulated exploration across prototypical families (e.g., CoCrFeNi and AlCoCrFeNi-based systems) and have motivated several frequently cited concepts such as the high-entropy effect, lattice distortion, sluggish diffusion, and cocktail effects [1,2,3,4,5,6,7,8] as shown in Figure 1. In particular, “sluggish diffusion” is not a universally observed characteristic of HEAs: reported diffusion behavior varies with the diffusion definition/measurement protocol, phase state, temperature window, and microstructural contributions (e.g., lattice vs. grain-boundary transport), underscoring the need for condition-annotated, evidence-traceable synthesis. At the same time, the field has repeatedly shown that many widely repeated claims are context-dependent: reported trends can change with alloy family, processing history, test definition, and environment. This reality makes HEA research a demanding setting for evidence-based synthesis and highlights the need for datasets and benchmarks that explicitly encode measurement definitions, uncertainty, and experimental context to avoid propagating contested or non-transferable conclusions [9,10].

A central bottleneck is the difficulty of converting dispersed and heterogeneous evidence into transferable, auditable design knowledge. Brute-force experimentation over even a small fraction of compositions and processing routes is infeasible, while first-principles calculations, molecular dynamics, and CALPHAD remain computationally expensive and are constrained by incomplete or uncertain thermodynamic and mobility databases for multi-component systems [12,13]. Despite two decades of work, key measurements and processing details are still scattered across a large and rapidly growing literature, and there is no widely adopted, federated resource that consistently links composition, processing history, microstructure, and properties in a standardized manner [14,15]. Consequently, extracting usable rules and testing mechanistic hypotheses at scale often depends on manual curation and expert interpretation, limiting reproducibility and slowing iteration.

Large language models (LLMs) offer a complementary route to mitigate this fragmentation—not by replacing physics-based modeling or domain expertise, but by serving as an interface layer that connects researchers to unstructured literature and increasingly complex digital toolchains. When integrated with retrieval, information-extraction pipelines, simulation codes, machine-learning models, and data-management systems, LLM-centered workflows can help structure evidence, reduce time spent on information retrieval, automate parts of workflow assembly, and support hypothesis generation under clearly defined constraints [16,17,18]. Early studies in adjacent materials and chemistry settings suggest that LLM-assisted extraction of formulation/synthesis information and organization of composition–performance evidence can reduce the overhead of turning text into machine-actionable resources. However, HEA research also exposes the limitations of current LLM deployments: literature data are sparse, inconsistent, and strongly condition-specific, and reliable use requires grounding, provenance, and validation rather than fluent text alone.

In parallel, machine learning for HEAs has matured such that supervised and generative models can capture aspects of high-dimensional composition–process–microstructure–property relationships and enable large-scale screening [19,20].

LLMs can add a semantic and procedural layer on top of this foundation, enabling workflows that are more reproducible, auditable, and task-driven—provided that outputs are grounded in traceable sources and evaluated with domain-appropriate protocols [21,22,23,24] as shown in Figure 2.

This review therefore focuses on a critical question: how do the defining challenges of HEA research—namely, condition-dependent and fragmented evidence, multimodal data integration, and resource-intensive iteration—shape the requirements for reliable LLM integration and credible evaluation [25]?

To answer this question, we move from general potential to a framework of operational requirements and testable practices, with three contributions:

First, we restate HEA discovery bottlenecks from an information and workflow perspective (Section 2), emphasizing metadata-aware extraction, schema normalization, hybrid multimodal pipelines, and auditable orchestration.

Second, we connect these requirements to a critical assessment of LLM capabilities and near-term application scenarios (Section 3 and Section 4), outlining how evidence-grounded retrieval, tool use, and constraint-following reasoning can support literature mining, design assistance, and simulation/workflow execution, while highlighting failure modes and necessary guardrails.

Third, we synthesize barriers to trustworthy deployment and propose HEA-specific evaluation and reporting practices (Section 5, Section 6 and Section 7), including protocol-aware benchmarks, provenance and artifact reporting, uncertainty/failure-case documentation, and reproducibility checks for tool-using pipelines.

Together, these elements aim to provide a grounded basis for integrating LLMs into the HEA research lifecycle and for evaluating such systems under realistic, condition-sensitive constraints [26].

2. Current Status and Core Challenges in High-Entropy Alloy Research

The performance potential of high-entropy alloys (HEAs) is intrinsically coupled to their design complexity. Breakthroughs in this field are constrained not only by metallurgical theory but also by systemic bottlenecks in information integration and research workflows. From the perspective of integrating large language models (LLMs) into HEA research paradigms, these bottlenecks manifest as: (i) an evidence base that is highly condition-dependent and fragmented [27]; (ii) a persistent gap in integrating multimodal and multiscale evidence chains; and (iii) resource-intensive trial-and-error cycles within a high-dimensional design space, together with an increasing demand for auditability. The following discussion reframes traditional challenges as concrete requirements that LLM-augmented workflows must address, thereby motivating the capability design and evaluation priorities in subsequent sections.

2.1. Condition Dependence and Fragmentation of HEA Evidence

HEA performance can be viewed as a joint function of composition, processing, microstructure, and test environment; perturbations in any of these variables may substantially shift—or even reverse—reported trends. Figure 3 exemplifies the practical consequence of such condition dependence: even within AlCoCrFeNi-derived systems, variations in electrolyte chemistry and concentration, temperature, scan parameters, and surface state can lead to markedly different corrosion responses and key metrics. Moreover, cross-study comparisons are often confounded by non-harmonized test protocols, making naïve aggregation of literature values systematically misleading. Consequently, the central challenge for LLM deployment in HEA research is not merely to extract “composition–property” pairs but to identify, extract, and link constraint-defining metadata from unstructured text—such as solution-treatment parameters, strain rate, electrolyte identity and concentration, scan rate, reference electrode, and surface preparation. This further requires robust entity resolution and schema normalization to map heterogeneous expressions (e.g., “annealed at 1100 °C for 2 h” versus “1100 °C/2 h solution treatment”) into a consistent canonical representation, which is a prerequisite for building queryable, machine-actionable knowledge graphs and comparable benchmarks [11,28,29,30].

2.2. Performance Optimization and Mechanistic Understanding

Mechanistic claims often require linking processing, SEM/TEM/XRD evidence, and property curves, yet these links are typically constructed manually as shown in Figure 4. LLMs can assist as cross-study, cross-document connectors for retrieval and alignment [31,32], but reliable quantification from scientific images/spectra generally requires a tool-first hybrid pipeline: specialized algorithms extract reproducible descriptors (e.g., grain size, phase fraction, precipitate statistics), while the LLM aligns descriptors with textual claims and tracks provenance.

2.3. Advanced Fabrication and Processing Techniques

The HEA design space is vast, and experimentation/high-fidelity computation is costly. Although CPMP surrogate models exist [33], their effectiveness depends on structured, condition-annotated data (e.g., solidification conditions summarized in Figure 5); manual literature curation remains a major bottleneck. LLMs can accelerate dataset assembly via automated mining and structured extraction, but the resulting pipelines must be traceable and auditable—recording data provenance, tool versions/configurations, and decision traces—to support reproducibility and failure analysis [34].

2.4. Summary of Key Challenges

In short, HEA research stresses LLM systems along three axes: metadata-aware extraction under strong condition dependence, hybrid multimodal integration, and reproducible, auditable workflow orchestration, as exemplified in Figure 6. These requirements motivate the capability design, evaluation criteria, and application scenarios developed in the following sections [35,36].

2.5. Positioning Relative to Adjacent Paradigms

LLM integration is best viewed as a meta-layer that complements established paradigms: databases/manual curation provide consolidation but limited semantic interaction; domain-specific encoders (e.g., MatSciBERT) support tagging and retrieval but not generative orchestration [37]; descriptor-based ML and Bayesian optimization enable prediction and search within defined spaces, while LLMs can help define constraints and evidence-backed search spaces from literature and orchestrate tool-based workflows for more auditable discovery.

3. Core Capabilities of Large Language Models and Alignment with HEA Research

The challenges summarized in Section 2—high-dimensional composition–process spaces and fragmented, multimodal evidence—define a workflow where progress is often limited by information synthesis and tool orchestration, in addition to model capability itself. Large language models (LLMs) are well suited to this setting because they can (i) extract and synthesize knowledge from large, unstructured corpora [38], (ii) generate and coordinate code and workflows that connect to simulation/ML tools, and (iii) support multimodal and constraint-aware reasoning when coupled with domain validators. In the following subsections, we summarize these capabilities and discuss how they map to concrete HEA research tasks [39].

Terminology and scope. Throughout this review, “LLM” primarily refers to autoregressive, generative models such as GPT-4 [40], Claude [41], and open-source alternatives (e.g., Llama 3 [42]). We also discuss domain-adapted models (e.g., AtomGPT) and tool-augmented components such as chart-understanding modules (e.g., ChartAdapter [43]) that enable grounded interpretation of scientific figures. This usage differs from bidirectional encoder-only models (e.g., BERT, SciBERT, MatSciBERT), which excel at embeddings for classification and retrieval but are not designed for generative task orchestration. In practice, an HEA AI system often benefits from both: an encoder can support entity tagging and retrieval, while a generative LLM orchestrates tools and communicates results to users. Our focus is on the integrative and generative role of the latter [44].

3.1. Large-Scale Text Comprehension, Knowledge Extraction and Natural-Language Reasoning

LLMs can process large volumes of unstructured text and, when coupled with information-extraction (IE) pipelines, convert literature content into structured records by extracting key entities and relations (composition, processing, phases/microstructure, properties, and test conditions). Such structuring supports searchable databases and knowledge graphs for downstream querying and integration [45] as illustrated in Figure 7.

When combined with retrieval-augmented generation (RAG) and traceable citations, LLM-based systems can produce evidence-grounded summaries, cross-study comparisons, and hypothesis cues while maintaining verifiability [46,47].

Implications for HEA research. This capability directly addresses the condition dependence and fragmentation highlighted in Section 2.1. HEA-focused extraction cannot stop at “composition–property” pairs; it must treat constraint-defining metadata (electrolyte identity/concentration, temperature, scan parameters, heat treatment, reference electrode, surface state) as first-class fields and align them across studies. Reliability therefore hinges on entity resolution and schema normalization: without canonicalization of terminology and protocols, structured outputs will incorrectly promote condition-specific observations into seemingly transferable trends.

3.2. Code Generation and Workflow Automation

Modern LLMs can translate natural-language intent into executable scripts and configurations and can further act as orchestrators that coordinate external tools (database queries, CALPHAD, DFT/MD, surrogate models) by planning steps, passing parameters, and iterating based on tool outputs [48,49,50,51], as exemplified by the database enhancement pipeline shown in Figure 8. In this tool-using paradigm, results are produced by verifiable tools, while the LLM contributes planning, interface translation, and interpretation—improving reproducibility relative to free-form numeric generation.

Implications for HEA research. This capability targets the resource-intensive iteration described in Section 2.3. LLMs can accelerate preliminary screening and parameter studies by turning high-level goals into executable pipelines while producing audit trails (versioned scripts/configs, tool versions, input files, and decision points). The core risk is silent failure: generated code and implicit assumptions can be wrong or biased. Robust deployment therefore requires validators and fixed protocols—i.e., tool-level checks and test suites—rather than relying on linguistic plausibility.

3.3. Cross-Modal and Logical Reasoning

Emerging multimodal LLMs can, in principle, integrate text with tables, charts, and selected scientific figures to support cross-document alignment and hypothesis formulation [52]. In HEA settings, credibility often depends on making physical and empirical constraints explicit. Thermodynamic consistency, compositional constraints, and heuristic rules (e.g., size-misfit descriptors) can be enforced either at the prompt level or—preferably—as post-generation validator gates that compute relevant metrics and flag or reject candidates that violate user-defined thresholds [53], as demonstrated in Figure 9. This shifts the system from language-only coherence toward physically plausible decision-making.

Implications for HEA research. This capability is central to the multimodal integration challenge in Section 2.2. LLMs can serve as “connectors” that align textual claims with multimodal evidence across papers; however, their direct, reproducible quantification from scientific images/spectra remains limited. Practical solutions therefore favor tool-first hybrid pipelines: specialized algorithms extract quantitative descriptors (grain size, phase fraction, precipitate statistics), and the LLM performs evidence-chain alignment, interpretation, and hypothesis generation with provenance tracking.

3.4. LLMs vs. Traditional Machine Learning: Comparative Advantages and a Synergistic Paradigm

Conventional machine learning (ML) excels at regression, classification, and optimization on structured numerical data but is constrained by feature definitions and training distributions. LLMs complement ML by enabling semantic retrieval, evidence integration, and constraint articulation from fragmented literature knowledge [54]. A pragmatic paradigm positions the LLM as a cognitive orchestrator (problem decomposition, constraint specification, explanation), while ML models act as efficient quantitative predictors/optimizers within the structured design space defined by the LLM [55]. Table 1 summarizes this division of labor and the interfaces that enable synergy in HEA workflows.

Implications for HEA research. This hybrid setting mitigates the compound challenges in Section 2: LLMs define and document constraints and priors with evidence links, while ML provides quantitative ranking and validation signals. Explicit uncertainty and provenance reporting strengthen auditability.

3.5. Emerging Landscape: Recent LLM Applications in HEA Research (2024–2025)

The integration of transformers and LLMs into HEA research is transitioning from conceptual promise to concrete experimentation. To ground our discussion in the most recent advancements, Table 2 surveys a curated set of pioneering studies published in 2024–2025 that explicitly deploy these architectures for HEA challenges. This emerging landscape reveals three distinct, yet complementary, paradigms: (i) supervised property prediction via transfer learning, (ii) LLM as an assistant for data curation and workflow setup, and (iii) LLM as an agent within autonomous, closed-loop discovery systems. Collectively, these works validate the feasibility of transformer/LLM approaches for HEA research, while also exposing gaps in evaluation rigor, reproducibility, and HEA-specific adaptation—gaps that the following sections on application scenarios and challenges aim to address [56,57,58,59,60].

4. Key Application Scenarios of LLMs in HEA Research

Building on the challenges defined in Section 2 and the capability assessment in Section 3, the value of LLMs in HEA research lies in their integration into tool-using, verifiable pipelines, rather than acting as standalone generators. This section outlines four near-term scenarios with explicit inputs/outputs, validator gates, and known limitations, emphasizing how reliability is enforced through provenance, constraints, and reproducible artifacts.

4.1. Intelligent Literature Mining and Knowledge-Graph Construction

Aim and scope. This scenario directly addresses the condition-dependent and fragmented evidence base in Section 2.1. The goal is to convert heterogeneous HEA publications into a queryable, condition-annotated knowledge base with traceable provenance.

LLM role and pipeline. Here, the LLM serves primarily as an information-extraction and canonicalization engine (Section 3.1), typically coupled with encoder-based retrieval and entity tagging. A representative pipeline includes: (i) retrieval and pre-tagging of candidate entities (e.g., composition, processing, microstructure, properties) as exemplified in Figure 10; (ii) fine-grained relation extraction and schema filling (e.g., mapping “solution-treated at 1100 °C for 2 h and water-quenched” into a canonical processing record); and (iii) population of a predefined knowledge-graph schema (e.g., an RDF-style schema as in Figure 7) linking entities to the originating passages [61].

Validators and artifacts. Outputs should include provenance-linked records and confidence metadata (e.g., extraction confidence, normalization status). The user-facing endpoint should support interactive queries that return condition-aware comparisons across studies (composition, processing state, test protocol, metric), with conflicting evidence explicitly flagged rather than silently averaged. In addition to the knowledge graph itself, the pipeline should produce reusable artifacts (schema, extraction rules/prompts, corpus snapshot identifiers) to enable auditability.

Limitations. Extraction accuracy degrades for highly non-standard phrasing and for values embedded in figures/tables. For such cases, human-in-the-loop review remains necessary [62].

4.2. Data-Driven Composition and Process Design Assistant

4.2.1. Typical Workflow for ML-Assisted HEA Design

Conventional ML pipelines for HEA design typically involve data collection, feature engineering, model training, and validation as shown in Figure 11, where feature engineering and documentation are frequent bottlenecks [63,64,65,66,67,68]. LLMs can assist by (i) proposing literature-grounded, physically motivated descriptors (e.g., valence electron concentration, size-misfit metrics, mixing enthalpy) for a stated target; (ii) generating reproducible feature-computation code (scripts/configs); and (iii) recording feature definitions, assumptions, and preprocessing choices to improve auditability.

Beyond LLM-assisted descriptor proposal, algorithmic methods (e.g., genetic algorithms) can systematically search large descriptor spaces to generate task-specific representations that improve model accuracy and generalization [69,70] as demonstrated in Figure 12. In practice, a robust workflow combines descriptor generation/search → leakage-safe evaluation → uncertainty calibration → interpretation aligned with metallurgical constraints.

4.2.2. LLM-Augmented Composition and Process Design

LLMs can function as design assistants when their outputs are constrained by domain rules and grounded in evidence. Given a goal (e.g., “lightweight, oxidation-resistant alloy above 800 °C”), an LLM can propose candidate systems and processing routes with a rationale, but such suggestions should be treated as hypotheses and validated by retrieval (supporting sources), thermodynamic/phase-stability checks, and safety constraints [71]. Emerging frameworks such as AlloyGPT recast composition–structure–property prediction as a sequence-completion task and report strong predictive performance [72] as illustrated in Figure 13; however, for HEA deployment, these approaches require rigorous benchmarking across diverse properties, transparent split strategies, and open artifacts (data/code/prompts) to establish reliability and reproducibility.

4.3. Multiscale Simulation Interface and Integrator

Aim and scope. This scenario lowers the barrier to multiscale modeling while satisfying the auditability and reproducibility requirements in Section 2.3 and Section 2.4. The LLM acts as a natural-language interface and orchestrator for simulation chains (Section 3.2), rather than as a simulator itself.

LLM role and pipeline. Given a high-level request (e.g., studying how cooling rate affects microstructure in an AlCoCrFeNi-derived system), the LLM decomposes the task into an executable workflow that invokes external tools (e.g., thermodynamic calculations, microstructure evolution solvers, and mechanics models) and propagates intermediate outputs to subsequent steps, as outlined in Figure 14. The emphasis is on runnable scripts and versioned configurations, not narrative descriptions [73,74,75,76].

Validators and artifacts. The value proposition is reproducibility: the workflow must log tool versions, input files, parameter settings, and intermediate outputs and should be rerunnable independently of the interactive LLM environment. Validation should include template/schema checks for generated inputs and comparison against benchmark cases when available. Figure 14 can summarize this architecture as a hub-and-spoke system linking knowledge sources to action modules with full trace logging.

Limitations. LLMs may generate invalid inputs or omit tool-specific constraints due to incomplete internalization of software semantics. For complex simulators, human review and template-based validation remain necessary, and the system should explicitly surface “unknown/untested” regimes rather than extrapolating [77,78].

4.4. Experimental Data Analysis and Mechanism Hypothesis Generation

Aim and scope. This scenario targets the multimodal evidence-integration bottleneck in Section 2.2. By positioning the LLM as an integrator and interpreter—rather than a primary quantitative analyzer—the goal is to accelerate the transition from raw experimental outputs (XRD patterns, SEM/TEM images, stress–strain curves) to structured summaries and testable mechanistic hypotheses.

Tool-first hybrid pipeline. LLMs should not be used as the main source of quantitative measurements. A reliable workflow follows a tool-first principle: dedicated algorithms (e.g., XRD peak fitting, microstructure segmentation, mechanical curve parsing) extract structured, provenance-linked descriptors (phase fractions, grain-size statistics, yield strength, hardening metrics). The LLM (leveraging the capabilities in Section 3.3) then synthesizes these descriptors with experimental conditions and literature-grounded context to produce an evidence-linked narrative and propose plausible mechanisms [79,80,81,82].

Validation and artifacts. Every quantitative claim must be traceable to upstream tool outputs and raw-data identifiers, with preprocessing parameters and code versions recorded. Mechanistic interpretations should be explicitly labeled as hypotheses (e.g., serrated flow suggesting possible dynamic strain aging) and accompanied by discriminating validation experiments (e.g., strain-rate jump tests) [83,84,85,86].

Limitations. Over-interpretation risk increases under low data quality or incomplete metadata, and expert verification remains essential for any conclusion that informs experimental decisions. Recent chart-understanding modules (e.g., ChartAdapter-style components) can improve LLM interaction with scientific plots but still require integration with robust quantitative toolchains [43,87] as illustrated in Figure 15.

4.5. Toward Executable, Trustworthy LLM Agent Architectures

To move from isolated tools (Section 4.1, Section 4.2, Section 4.3 and Section 4.4) to a persistent research assistant, HEA workflows benefit from an explicit human-in-the-loop agent architecture that coordinates auditable, reproducible multi-step execution (Section 2.4), rather than pursuing full autonomy. A practical design couples an LLM planner with (i) retrieval grounding, (ii) tool interfaces for simulation/data/analysis, (iii) validator gates enforcing physical and feasibility constraints, and (iv) state/memory tracking with complete decision logs. This enables a “plan–execute–verify” loop that outputs ranked, falsifiable hypotheses together with full audit artifacts (citations, constraints applied, tool outputs, and decision traces). The main limitation is scientific trust: uncertainty must be surfaced, and low-confidence steps must trigger abstention or expert escalation, with humans retaining final decision authority [88].

5. Challenges in Applying LLMs to HEA Research

While the potential of LLMs in HEA research is substantial, their robust and reliable deployment faces a series of interconnected hurdles. These hurdles can be categorized into technical challenges, intrinsic to the data and models themselves, and systemic, pragmatic challenges concerning integration, resources, and ethics. Addressing both categories is critical for transitioning from proof-of-concept demonstrations to trusted, everyday research tools.

5.1. Case Study: LLM-Augmented Discovery of a Corrosion-Resistant HEA

To illustrate how LLMs can combine the core capabilities in Section 3—evidence-grounded literature mining, workflow orchestration, and (when available) multimodal analysis—we outline an illustrative and technically grounded workflow for screening marine-grade corrosion-resistant HEA candidates. This example is not intended to claim autonomous discovery; rather, it highlights the guardrails and validation steps required for reliable use.

Phase 1: Evidence-grounded literature mining and hypothesis formulation.

The LLM is used to retrieve and summarize corrosion-related evidence from the HEA literature under explicit scope constraints (e.g., chloride environments; specimen form; processing state; corrosion metric). Instead of producing free-form claims, the system outputs a structured evidence table linking composition–processing–test conditions–metrics to cited passages. Within this evidence, the LLM identifies commonly reported trends (e.g., the roles of Cr and Mo in chloride media) and flags inconsistencies across studies, including reports that associate higher Mo additions with increased intergranular corrosion susceptibility under certain processing states. A resulting hypothesis is formulated as a conditional, testable claim with explicit boundaries; for example, “For AlCoCrFeNi-based alloys evaluated in chloride solutions under a specified processing state, modest Mo addition (≤3 at.%) together with elevated Cr content is hypothesized to improve pitting resistance relative to the AlCoCrFeNi baseline under the specified processing state, with intergranular corrosion risk treated as a processing-dependent failure mode to be screened and validated [89,90,91].”

Phase 2: Tool-using screening with explicit constraints and validators.

Given a constrained screening request (e.g., candidates near AlCoCrFeNi with Mo = 1–3 at.% and Cr = 20–30 at.%), the LLM acts as an orchestrator that calls external tools rather than generating unverifiable results. A minimal pipeline includes: (i) composition constraints and safety/cost filters; (ii) phase-stability pre-screening using CALPHAD where database coverage is adequate, with database/version logging and explicit “coverage uncertain” flags; (iii) optional higher-fidelity calculations (e.g., targeted DFT on a small subset) and/or surrogate descriptor models for ranking within a defined scope [92]. Crucially, outputs are reported with provenance (tool versions, input files, assumptions) and ranked only for candidates evaluated under comparable conditions. The shortlist is accompanied by reasons for exclusion (e.g., predicted multiphase stability, database limitations, segregation risk) to prevent over-interpretation.

From this orchestrated screening, the workflow proceeds from producing a ranked shortlist to formulating concrete, testable hypotheses. Accordingly, a small set of example compositions within the stated bounds may be reported, e.g., Al₁₉Co₁₉Cr₂₅Fe₁₇Ni₁₈Mo₂, Al₁₈Co₁₈Cr₂₈Fe₁₆Ni₁₈Mo₂, and Al₂₀Co₁₈Cr₂₂Fe₁₈Ni₁₉Mo₃, which are treated as falsifiable candidates rather than definitive recommendations. Property expectations are stated comparatively against the AlCoCrFeNi baseline under the pre-registered protocol in Phase 3, with uncertainty reported when tool disagreement is non-negligible. Candidate retention is conditional on explicit validator gates, including: (i) CALPHAD applicability with database/version logging (otherwise flagged as “coverage uncertain”); (ii) exclusion when brittle intermetallic phases (e.g., σ/Laves/B2) are predicted above a user-defined tolerance in the target state; and (iii) downgrading or exclusion when Scheil-type segregation indicators suggest elevated processing-dependent intergranular-corrosion risk. These structured outputs (composition + validation flags + uncertainty markers) directly inform Phase 3 testing. Finally, common failure modes—such as spurious single-phase predictions arising from limited database support and processing-driven segregation/precipitation that can reverse Mo–Cr trends—are explicitly recorded so that null/negative outcomes are incorporated into the knowledge update rather than overgeneralized.

Phase 3: Experimental validation and structured knowledge update.

Top-ranked candidates are synthesized and tested under a pre-specified protocol (e.g., 3.5 wt.% NaCl, defined scan rate, reference electrode, surface preparation), with an explicit baseline comparator such as 304/316 stainless steel measured under the same protocol when feasible. The LLM can assist in organizing results and extracting structured descriptors from characterization outputs (XRD/SEM-EDS) but should not “confirm” phases without traceable analysis outputs. Validated results are stored as structured records (composition–processing–microstructure–property with test conditions and uncertainty) and linked to the supporting raw/processed data, enabling future retrieval and meta-analysis.

What this case study demonstrates.

The value of this workflow is auditable acceleration: it reduces friction in the literature synthesis, toolchain execution, and reporting while keeping each step verifiable through grounding, validators, and logged artifacts. The same structure also makes limitations explicit (database coverage, cross-study condition mismatch, and model uncertainty), which is essential for trustworthy use in HEA design.

5.2. Technical Challenges

5.2.1. Data Quality and Scarcity

The sparsity, heterogeneity, and noisiness of available HEA data hinder robust learning of composition–process–structure–property relationships. Reports are often inconsistent (qualitative claims without numerical values; incomplete processing metadata), biased toward “successful” results, and missing crucial context (heat-treatment schedules, impurity levels, specimen form, and test protocol). Emerging consolidated resources (e.g., CODHEM) are valuable but remain incomplete, and naive pooling can be misleading because properties such as yield strength are dominated by microstructural mechanisms that are not captured by simple mixture rules [93]. Figure 16 illustrates this limitation: rule-of-mixtures baselines can deviate substantially from measured values, especially for multiphase alloys, indicating that models trained on weakly labeled or poorly contextualized data may learn spurious correlations [94]. Consequently, credible LLM-enabled pipelines require curation, schema normalization, provenance, and conflict annotation, not merely larger volumes of text.

5.2.2. Model Accuracy, Hallucination and Reliability

LLMs can generate fluent but incorrect content (“hallucinations”), which is particularly risky in HEA research because erroneous phase claims, misreported processing parameters, or fabricated citations can misdirect expensive experiments and propagate unreproducible conclusions. Hallucination is a measurable failure mode: surveys define it as output that is unfaithful to underlying sources or inconsistent with verifiable facts and summarize detection and mitigation approaches across tasks [95]. For scientific writing and literature-enabled decision support, evaluations report non-trivial rates of incorrect or fabricated references when models are asked to generate citations, motivating strict provenance requirements and routine post-verification [96]. For example, a large-scale test reported fabricated citations at substantial rates (e.g., 55% for GPT-3.5 and 18% for GPT-4) [97], and systematic-review-style evaluations similarly found notable rates of hallucinated references (e.g., 28.6% for GPT-4 in that setting) [96]. Related “ghost” bibliographic references have also been discussed in the scientometrics literature [98]. Consistent with these findings, technical reports for frontier models acknowledge hallucinations as a core limitation, which is why retrieval-augmented generation (RAG), uncertainty quantification, and domain-constraint checking are widely recommended for high-stakes scientific applications.

In HEA workflows, hallucinations most commonly manifest as: (i) fabricated or misattributed references, (ii) incorrect phase-constitution statements (e.g., single-phase vs. multiphase), and (iii) incorrect or incomplete processing routes (solutionizing/aging temperatures, dwell times, quench paths) that are essential for reproducibility. Accordingly, reliability should be implemented as a three-layer guardrail: (1) grounding via retrieval with cited passages; (2) validation via physical constraints and external tools/databases (e.g., CALPHAD checks, composition constraints, safety rules); and (3) uncertainty-based triage that flags low-confidence outputs for human review [99,100].

Uncertainty-based hallucination detection. Unreliable outputs can be flagged using internal-state signals (e.g., token-level probabilities and entropy) or behavior-based signals (e.g., self-consistency under resampling, paraphrasing, or multi-agent critique). For example, SelfCheckGPT samples multiple responses and flags answers as uncertain when key facts disagree across samples [101,102,103,104,105,106,107]. In a materials context, repeatedly asking “What is the dominant strengthening mechanism in aged AlCoCrFeNi?” and receiving divergent mechanisms (e.g., precipitation vs. ordering vs. grain-boundary-dominated strengthening) should trigger retrieval-based grounding and/or expert validation before the claim is used to justify an HEA design decision [101,102,103,104,105,106,107].

5.2.3. Integration of Multimodal Data and Context

HEA research is inherently multimodal: key evidence appears not only in text, but also in plots, microscopy images, and diffraction/spectroscopy outputs. However, current LLMs remain strongest in text understanding [108,109], and crucial experimental context may be missing if figures and raw signals are not explicitly incorporated. For example, the phrase “nanoscale precipitates” is not actionable without quantitative descriptors that are typically obtained from TEM (size distribution, morphology, number density, and spatial correlations). Therefore, integrated comprehension requires (i) aligned datasets that link text to experimental data and annotations, and (ii) multimodal architectures capable of reasoning jointly over language and scientific visual/signal modalities.

Until such capabilities mature, practical deployments should adopt hybrid pipelines: domain-specific vision/signal-processing tools first extract structured descriptors (e.g., grain size 15 ± 5 μm, phase fraction, peak positions/intensities, precipitate statistics), which are then provided to the LLM as grounded inputs for synthesis, explanation, and decision support [110]. This design keeps the LLM’s role primarily at the level of integration and reasoning, while measurement-level interpretation remains anchored to validated analysis routines.

5.3. Systemic and Pragmatic Challenges

5.3.1. Computational Resources and Domain Adaptation

Efficient, domain-adapted language models are likely necessary for broad academic adoption in materials research (e.g., a potential “MatSci-GPT”). Evidence from chemistry is instructive: Chemformer, pre-trained on SMILES strings, achieves strong performance across multiple chemistry tasks with far fewer parameters than general-purpose LLMs, illustrating the value of domain-specific pretraining and task-focused architectures [111] as schematized in Figure 17.

In contrast, frontier LLMs incur substantial computational costs during both training and inference. Training cost scales roughly with model size and training tokens and typically requires large GPU/TPU clusters, long wall-clock time, and significant energy use; GPT-3 (175B parameters) is a widely cited reference point, and follow-on analyses quantify associated energy use and carbon footprint [112]. Inference remains expensive due to sequential autoregressive decoding and the memory overhead of key–value caching, which grows with model size and context length, increasing serving cost and limiting throughput [113]. These constraints motivate efficiency-oriented strategies—distillation, quantization, parameter-efficient fine-tuning, and RAG—to combine smaller models with curated external knowledge. For HEA applications, full fine-tuning of frontier models may be infeasible for many labs, strengthening the case for standardized, user-friendly toolchains and validated workflows that lower both compute and expertise barriers.

5.3.2. Open-Source and Collaborative Infrastructure

Open-source, domain-adapted models and shared resources are critical for democratizing access and enabling reproducible research. Community toolkits that provide pretrained models, benchmark datasets, and standardized evaluation scripts can lower the entry barrier for materials researchers and reduce duplicated effort. Because training frontier-scale transformers typically requires large GPU clusters and substantial energy cost, and inference at scale incurs non-trivial compute and memory overhead, shared infrastructure and lightweight, well-documented models are particularly important for routine HEA workflows. For HEA-specific deployment, releasing prompts, corpus snapshots, extraction schemas, and validation scripts is often as important as releasing model weights, because these artifacts determine whether results can be reproduced and audited across labs.

5.3.3. Strategic Model Selection: Open vs. Closed Source

Choosing between closed-source and open-source LLMs poses practical trade-offs for academic labs. Closed-source models (e.g., GPT-4, Claude) often provide strong general reasoning and coding performance out of the box, enabling rapid prototyping of multi-step workflows. However, they raise concerns regarding cost, data governance, long-term reproducibility, and limited customization. Open-source models (e.g., Llama 3, Mistral) require more deployment expertise but provide control over data handling, enable parameter-efficient fine-tuning on HEA-specific corpora, and support reproducible evaluation under fixed model versions—an essential principle for scientific rigor. A pragmatic hybrid strategy is to use a powerful closed-source model for high-level planning and explanation as conceptualized in Figure 18, while delegating well-defined, repetitive tasks (e.g., entity extraction, schema filling, or citation verification) to smaller, fine-tuned open-source models. Continued development of community-driven, domain-adapted open models (e.g., an “HEA-focused” LLM) is therefore a key enabler for equitable and sustainable adoption [114].

5.3.4. Ethical Considerations, Bias, and the Challenge of Trustworthy AI

Deploying LLMs in HEA research introduces ethical and trustworthiness risks that must be managed explicitly. Because training data reflect historical publication practices, models may amplify selection bias (overrepresenting “successful” results) and preferentially recommend well-studied element sets while underexploring less conventional chemistries [115]. The opacity of model reasoning complicates accountability: when a model proposes a composition or process, researchers must be able to trace the evidence, constraints, and assumptions that produced the suggestion. Safety and sustainability concerns also arise if models recommend toxic, hazardous, or environmentally burdensome elements and processes without explicit guardrails. Finally, issues of intellectual property and academic integrity require attention: authorship attribution, ownership of model-generated hypotheses, and risks of unintentional plagiarism in the literature summaries must be handled through transparent disclosure and citation practices. Addressing these concerns requires domain-specific guardrails (toxicity/cost/sustainability filters), provenance-first interfaces, human-in-the-loop review for high-stakes decisions, and evaluation benchmarks that include robustness, bias, and interpretability metrics.

5.3.5. Towards a Human-in-the-Loop Collaborative Model

Successful integration of LLMs into HEA research depends on designing effective human–AI collaboration rather than attempting to replace domain expertise. In a practical augmented-intelligence workflow, the LLM performs data-intensive tasks (retrieval, structured extraction, drafting protocols, and pre-analysis), while the materials scientist sets objectives, specifies constraints, designs decisive experiments, and validates interpretations as visualized in Figure 19. To support this division of labor, LLM interfaces should communicate provenance and uncertainty, expose intermediate artifacts (retrieved passages, extracted tables, tool outputs), and make it easy for experts to audit and correct errors. Establishing transparent human-in-the-loop processes is therefore a prerequisite for trustworthy, routine use of LLM-enabled workflows in HEA research [116].

6. Future Directions

LLM-enabled HEA research is likely to evolve from assistive tools toward more integrated, tool-using systems, and—under strict guardrails—toward partial autonomy in well-scoped tasks [117]. Rather than a deterministic timeline, progress will depend on data infrastructure, evaluation standards, and the maturity of validation pipelines. The directions below highlight concrete research priorities that can move the field from promising demonstrations to dependable workflows.

6.1. Domain-Specialized Foundation Models

A key near-term direction is the development of materials-focused language models and multimodal models that are pre-trained and fine-tuned on curated materials corpora, schemas, and ontologies (e.g., an HEA-focused model or MatSciBERT-style encoders for extraction) [118]. Compared with general-purpose LLMs, such models can better reflect domain terminology and reporting conventions, support structured extraction (composition–processing–microstructure–property with test conditions), and reduce failure modes arising from ambiguous or incomplete context. Importantly, “domain specialization” should be defined operationally: improvements should be demonstrated through lower hallucination rates, stronger provenance behavior, and better performance on HEA-relevant tasks such as constraint-aware composition suggestion, phase reasoning grounded in thermodynamic checks, and schema-consistent data extraction.

6.2. Closed-Loop, Autonomous Research Pipelines

LLMs are increasingly discussed as components of “self-driving laboratories,” where an AI system integrates language models with experimental automation and simulation/analysis engines to propose candidates, execute measurements or calculations, analyze outcomes, and iteratively refine the search strategy [119,120,121]. In HEA contexts, a realistic near-term pathway is semi-autonomous closed-loop operation: high-throughput synthesis (e.g., spark plasma sintering or combinatorial methods) and automated characterization (e.g., XRD, hardness/nanoindentation) generate standardized outputs that are ingested by an LLM-driven agent, which then (i) updates surrogate models or constraints, (ii) proposes the next experiments under explicit safety/feasibility rules, and (iii) logs provenance and uncertainty for human review. The practical impact will be determined by automation throughput and, critically, by validation requirements; therefore, reporting should emphasize verified cycle time, decision auditability, and failure handling rather than aspirational “full autonomy.”

6.3. Enhanced Scientific Discovery and Reasoning

LLMs may contribute beyond information access by supporting structured reasoning—provided that outputs are grounded, validated, and framed as hypotheses rather than conclusions, and paired with decisive tests.

6.3.1. LLM-Assisted Hypothesis Generation and Pattern Discovery

LLMs can help synthesize dispersed evidence and propose structured “what-if” hypotheses together with test plans and the evidence that motivated them. For example, an LLM might connect observations across studies and propose a mechanism-transfer hypothesis (e.g., “a reported creep improvement in Hf-containing alloys may be consistent with oxide-dispersion-related strengthening under specific processing histories; test transferability by specifying alloy family, oxidation/processing steps, and a defined creep metric”), while explicitly listing the assumptions, the key citations, and the discriminating experiments needed for validation [122,123,124,125,126]. The emphasis should be on traceable rationale and decision-relevant experiments rather than narrative plausibility.

6.3.2. LLM-Centered Meta-Optimization and “Process Engineering”

LLMs can also be placed in meta-optimization loops where they propose candidate scoring rules, feature representations, or small program fragments that are evaluated by deterministic tools (e.g., CALPHAD/DFT/surrogate models/experiments). In FunSearch-style workflows, the model generates candidate heuristics, an external evaluator scores them, and the best candidates are iteratively retained and mutated, enabling systematic exploration of rule space under hard constraints [127] as schematized in Figure 20. For HEA research, this paradigm is most compelling when (i) objective functions are explicitly defined, (ii) evaluators are deterministic and logged, and (iii) constraints (phase stability, toxicity/cost, manufacturability) are enforced to prevent “reward hacking” and physically implausible solutions.

6.4. Open and Collaborative Ecosystems

The long-term impact of LLMs will depend on the surrounding ecosystem: open, community-maintained HEA datasets; transparent benchmarks; and reproducible, domain-adapted reference implementations [128]. Efforts such as the Materials Data Facility and HEA-focused database initiatives are important steps, but tighter integration with LLM pipelines is needed so that curated datasets, thermodynamic calculations, and microstructural descriptors can be accessed programmatically with standardized schemas and provenance tracking [129]. Achieving this requires community-level practice changes: richer metadata standards, more systematic reporting of negative/inconclusive results, and routine sharing of data/code/workflows in machine-readable formats. Conversely, AI developers must tailor models, prompting, and evaluation to HEA-specific heterogeneity and condition dependence rather than treating the problem as generic text mining [130]. In such an ecosystem, LLMs can also act as coordination layers—translating between metallurgical concepts and ML abstractions and converting human-readable protocols into reproducible, trackable workflows—while preserving attribution and provenance [131].

6.5. Benchmarks and Evaluation for HEA-Focused LLMs

Systematic evaluation is essential for trustworthy adoption. HEA-focused benchmarks should cover tasks such as schema-consistent data extraction, composition–phase reasoning under constraints, mechanistic explanation with evidence, and constrained design with explicit feasibility rules. Evaluation should combine automatic metrics (extraction accuracy, constraint-violation rate, citation existence/relevance) with expert review for physical plausibility and decision utility. Existing materials benchmarks (e.g., question answering and reasoning suites) provide a starting point but require HEA-specific extensions and leakage-resistant split strategies. Recent benchmark results also indicate that even strong models can perform well on retrieval-style tasks while struggling with deeper interpretation, underscoring the need for targeted development and evaluation protocols [132,133] as exemplified in Figure 21.

7. Benchmarks, Protocols, and Considerations for Reproducibility

Transitioning from proof-of-concept demonstrations to reliable, everyday research tools requires standardized evaluation tailored to the unique nature of LLMs. Their capacity to generate code, text, and multimodal interpretations necessitates bespoke benchmarks that go beyond traditional machine learning metrics to address performance, reliability, and reproducibility. This section outlines a framework for evaluating LLMs within HEA research, detailing the required foundational resources, application-specific metrics, and critical calibration protocols for scientific outputs.

7.1. Foundational Resources for Development and Benchmarking

A rigorous evaluation ecosystem is built upon accessible, high-quality data and tools. A reproducible evaluation ecosystem depends on accessible datasets, tools, and reference implementations. Curated experimental databases, such as the Consolidated Open Database for High-Entropy Materials (CODHEM), provide reference property data for model validation and for verifying retrieved claims in RAG pipelines. Computational repositories like the Materials Project, accessible via open APIs, offer critical data for validating the thermodynamic stability of design suggestions [134]. For knowledge representation and reasoning assessment, prototypes of large-scale materials knowledge graphs provide essential templates. Simultaneously, open-source software toolkits deliver pre-trained models and benchmarks that significantly lower the barrier to developing domain-adapted NLP models [110]. Furthermore, foundation models pre-trained in adjacent domains, such as chemistry, offer invaluable architectural and methodological references for developing specialized counterparts in HEA research [135]. The effective utilization of this integrated resource landscape is fundamental for fostering a reproducible and collaborative research environment [60].

7.2. Evaluation Metrics

A robust evaluation framework must define clear tasks, metrics, and baselines for each major category of LLM application in HEA research.

1.: Literature Mining and Knowledge Extraction

The primary task involves processing a corpus of HEA literature to identify, normalize, and relate structured entities such as alloy compositions, processing routes, phases, and property values [136]. Performance is measured using precision, recall, and F1-score for named entity recognition and relation extraction on manually annotated test sets. A critical additional metric is normalization accuracy, which assesses the model’s ability to map varied textual representations to canonical forms. Relevant baselines include rule-based extractors, general-domain NER models, and domain-specific pre-trained encoders fine-tuned on materials science text [137].

2.: Code and Workflow Generation

Evaluation here centers on the model’s ability to translate high-level natural language descriptions into syntactically correct and functionally appropriate code or configuration files [138]. Key metrics encompass the first-pass execution success rate, quantifying the percentage of generated scripts that run without errors, and the human-in-the-loop correction time (minutes per script under a fixed protocol), which measures the efficiency gain compared to manual script writing [139]. These should be benchmarked against template-based code generation or few-shot prompting of general-purpose LLMs without domain-specific fine-tuning [140].

3.: Multimodal Scientific Understanding

This task requires models to interpret scientific images—such as SEM micrographs, XRD patterns, or stress–strain curves—often with additional contextual information, to provide descriptions, identify features, or answer questions. Evaluation relies on accuracy for classification tasks (e.g., feature/phase-label identification under expert-defined labels), all measured against expert-annotated ground truth. Emerging multimodal science models adapted to materials data, as well as hybrid pipelines combining specialized computer vision models with LLMs, serve as relevant baselines [141,142].

4.: Constrained Design and Hypothesis Generation

This advanced task challenges models to propose novel, plausible alloy compositions or processing routes that satisfy given design objectives and domain constraints, justified by underlying principles. Two primary metrics are essential: the physical plausibility rate, measuring the fraction of suggestions satisfying basic domain constraints, and the experimental validation hit rate, representing the percentage of candidates that meet target properties upon actual synthesis and testing [143]. These capabilities should be compared to traditional approaches like constrained random sampling, descriptor-based optimization algorithms, or active learning loops without LLM-based reasoning [144].

7.3. Special Considerations for Calibrating Scientific Numeric and Symbolic Outputs

HEA research demands precise handling of both numeric values and symbolic notations, requiring tailored calibration protocols. For numeric outputs, such as predicted yield strength, models should provide both a point estimate and a confidence interval derived from internal uncertainty or ensemble methods; this uncertainty must be calibrated against experimental measurement error ranges to be meaningful [145]. For symbolic or formulaic outputs, a rule-based post-processor is often needed to verify syntactic correctness and to cross-reference against authoritative materials databases for existence. Finally, maintaining unit consistency is critical; a dedicated module should check and convert all physical quantities to a standard unit system before any downstream use or comparison, ensuring data integrity and interoperability.

8. Summary and Key Findings

The systematic integration of LLMs into HEA research necessitates addressing interconnected challenges while leveraging new capabilities. To synthesize the discussion across the preceding sections, the core dimensions, current bottlenecks, LLM-enabled opportunities, and priority actions are consolidated as summarized in Table 3.

9. Conclusions

The pursuit of high-performance high-entropy alloys (HEAs) is hampered by the high-dimensional, non-linear mapping between composition, processing, microstructure, and properties, together with fragmented, sparse, and sometimes conflicting experimental evidence. Large language models (LLMs) can provide a practical interface layer to navigate this complexity—not as a replacement for domain expertise, but as an augmenting tool. Consistent with recent pioneering studies, LLMs can contribute through three concrete pathways: (i) literature mining and structuring of dispersed HEA knowledge; (ii) learning transferable representations for property prediction via domain-tailored pretraining and fine-tuning; and (iii) orchestrating computational, and in early demonstrations, semi-autonomous experimental workflows with external verifiers.

Our analysis also highlights a critical reliability gap that must be addressed before LLMs can be considered dependable HEA research partners. Key priorities include: rigorous out-of-distribution (OOD) evaluation across composition and processing regimes; uncertainty-calibrated outputs for both numeric predictions and symbolic suggestions (e.g., phase labels and candidate compositions); enforcement of physical and thermodynamic constraints to filter implausible designs; and scientific integrity practices such as traceable data provenance, leakage-safe evaluation protocols, and transparent sharing of prompts, splits, and code where feasible. Importantly, these requirements are measurable—through citation validity, OOD performance, calibration, and constraint-satisfaction rates—and should be reported alongside conventional accuracy metrics.

To close this gap, we propose a community-focused, HEA-tailored roadmap. In the near term (1–2 years), high-impact efforts should center on retrieval-augmented systems that convert the HEA corpus into structured, auditable knowledge, together with open benchmarks for extraction quality and dataset reliability. In the mid-term (2–4 years), domain-adapted models should be evaluated on multi-task HEA benchmarks that test prediction, explanation, and constrained inverse design, with an emphasis on generalization and uncertainty awareness. Longer-term (4–5+ years) closed-loop discovery platforms will require LLM agents that remain subordinate to continuous verification by physics-based models (e.g., CALPHAD, DFT, and validated surrogates) within auditable, human-in-the-loop frameworks.

Ultimately, accelerated HEA discovery will hinge less on model scaling alone and more on co-evolution: domain-adapted models grounded in high-quality data, interoperable and trustworthy research infrastructure, and evaluation standards that faithfully reflect the complexity of the alloy design space [146,147].

Author Contributions

Y.G.: methodology; software; validation; formal analysis; investigation; resources; data curation; writing—original draft preparation; C.Y.: Conceptualization; writing—review and editing; visualization; supervision; project administration; funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

The authors gratefully acknowledge the financial support of Shanghai Natural Science Foundation (25ZR1401430) and Science and Technology Co-operation Program of Shanghai Jiao Tong University in Inner Mongolia Autonomous Region-Action Plan of Shanghai Jiao Tong University for “Revitalizing Inner Mongolia through Science and Technology” (2023XYJG0001-01-01).

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Yeh, J.-W.; Chen, S.K.; Lin, S.-J.; Gan, J.-Y.; Chin, T.-S.; Shun, T.-T.; Tsau, C.-H.; Chang, S.-Y. Nanostructured high-entropy alloys with multiple principal elements: Novel alloy design concepts and outcomes. Adv. Eng. Mater. 2004, 6, 299–303. [Google Scholar] [CrossRef]
Fisher, D. High-Entropy Alloys—Microstructures and Properties; Trans Tech Publications Ltd.: Zürich, Switzerland, 2015. [Google Scholar]
Nartita, R.; Ionita, D.; Demetrescu, I.A. A modern approach to heas: From structure to properties and potential applications. Crystals 2024, 14, 451. [Google Scholar] [CrossRef]
Xiong, W.; Guo, A.X.Y.; Zhan, S.; Liu, C.-T.; Cao, S.C. Refractory high-entropy alloys: A focused review of preparation methods and properties. J. Mater. Sci. Technol. 2023, 142, 196–215. [Google Scholar] [CrossRef]
Atli, K.C.; Karaman, I. A short review on the ultra-high temperature mechanical properties of refractory high entropy alloys. Front. Met. Alloys 2023, 2, 1135826. [Google Scholar] [CrossRef]
Cui, J.-M.; Nong, Z.-S.; Cui, X.; Xu, Q.-G.; Zhang, H.-L.; Leng, Y.; Xu, R.-Z.; Arzikulov, E. Effect of carbon addition on microstructures and mechanical properties of laser cladding alcocrfeni2.1 alloy coatings. Mater. Today Commun. 2025, 42, 111534. [Google Scholar] [CrossRef]
Hamdi, H.; Abedi, H.R.; Zhang, Y. A review study on thermal stability of high entropy alloys: Normal/abnormal resistance of grain growth. J. Alloys Compd. 2023, 960, 170826. [Google Scholar] [CrossRef]
Li, D.; Liaw, P.K.; Xie, L.; Zhang, Y.; Wang, W. Advanced high-entropy alloys breaking the property limits of current materials. J. Mater. Sci. Technol. 2024, 186, 219–230. [Google Scholar] [CrossRef]
Tsai, K.Y.; Tsai, M.H.; Yeh, J.W. Sluggish diffusion in co–cr–fe–mn–ni high-entropy alloys. Acta Mater. 2013, 61, 4887–4897. [Google Scholar] [CrossRef]
Dąbrowa, J.; Zajusz, M.; Kucza, W.; Cieślak, G.; Berent, K.; Czeppe, T.; Danielewski, M. Demystifying the sluggish diffusion effect in high entropy alloys. J. Alloys Compd. 2019, 783, 193–207. [Google Scholar] [CrossRef]
Arun, S.; Radhika, N.; Saleh, B. Effect of additional alloying elements on microstructure and properties of alcocrfeni high-entropy alloy system: A comprehensive review. Met. Mater. Int. 2025, 31, 285–324. [Google Scholar] [CrossRef]
Himanen, L.; Geurts, A.; Foster, A.S.; Rinke, P. Data-driven materials science: Status, challenges, and perspectives. Adv. Sci. 2019, 6, 1900808. [Google Scholar] [CrossRef]
Liu, Z.-K. First-principles calculations and calphad modeling of thermodynamics. In Zentropy; Jenny Stanford Publishing: Singapore, 2024; pp. 3–50. [Google Scholar]
Hambarde, K.A.; Proenca, H. Information retrieval: Recent advances and beyond. IEEE Access 2023, 11, 76581–76604. [Google Scholar] [CrossRef]
Han, S.; Wang, M.; Zhang, J.; Li, D.; Duan, J. A review of large language models: Fundamental architectures, key technological evolutions, interdisciplinary technologies integration, optimization and compression techniques, applications, and challenges. Electronics 2024, 13, 5040. [Google Scholar] [CrossRef]
Li, Z.; Pradeep, K.G.; Deng, Y.; Raabe, D.; Tasan, C.C. Metastable high-entropy dual-phase alloys overcome the strength–ductility trade-off. Nature 2016, 534, 227–230. [Google Scholar] [CrossRef]
Lei, Z.; Liu, X.; Yuan, W.; Wang, H.; Jiang, S.; Wang, S.; Hui, X.; Wu, Y.; Gault, B.; Kontis, P.; et al. Enhanced strength and ductility in a high-entropy alloy via ordered oxygen complexes. Nature 2018, 563, 546–550. [Google Scholar] [CrossRef] [PubMed]
Yang, Y.; Chen, T.; Tan, L.; Poplawsky, J.D.; An, K.; Wang, Y.; Samolyuk, G.D.; Littrell, K.; Lupini, A.R.; Borisevich, A.; et al. Bifunctional nanoprecipitates strengthen and ductilize a medium-entropy alloy. Nature 2021, 595, 245–249. [Google Scholar] [CrossRef]
Yang, T.; Zhao, Y.L.; Tong, Y.; Jiao, Z.B.; Wei, J.; Cai, J.X.; Han, X.D.; Chen, D.; Hu, A.; Kai, J.J.; et al. Multicomponent intermetallic nanoparticles and superb mechanical behaviors of complex alloys. Science 2018, 362, 933–937. [Google Scholar] [CrossRef] [PubMed]
Huo, W.; Fang, F.; Zhou, H.; Xie, Z.; Shang, J.; Jiang, J. Remarkable strength of cocrfeni high-entropy alloy wires at cryogenic and elevated temperatures. Scr. Mater. 2017, 141, 125–128. [Google Scholar] [CrossRef]
Liu, X.; Zhang, J.; Pei, Z. Machine learning for high-entropy alloys: Progress, challenges and opportunities. Prog. Mater. Sci. 2023, 131, 101018. [Google Scholar] [CrossRef]
Elkatatny, S.; Abd-Elaziem, W.; Sebaey, T.A.; Darwish, M.A.; Hamada, A. Machine-learning synergy in high-entropy alloys: A review. J. Mater. Res. Technol. 2024, 33, 3976–3997. [Google Scholar] [CrossRef]
Golbabaei, M.H.; Zohrevand, M.; Zhang, N. Applications of machine learning in high-entropy alloys: A review of recent advances in design, discovery, and characterization. Nanoscale 2025, 17, 20548–20605. [Google Scholar] [CrossRef]
Sun, Y.; Ni, J. Machine learning advances in high-entropy alloys: A mini-review. Entropy 2024, 26, 1119. [Google Scholar] [CrossRef]
Bagdasaryan, A.; Pshyk, A.; Coy, L.; Kempiński, M.; Pogrebnjak, A.; Beresnev, V.; Jurga, S. Structural and mechanical characterization of (tizrnbhfta) n/wn multilayered nitride coatings. Mater. Lett. 2018, 229, 364–367. [Google Scholar] [CrossRef]
Wang, W.; Wang, J.; Yi, H.; Qi, W.; Peng, Q. Effect of molybdenum additives on corrosion behavior of (cocrfeni)100xmox high-entropy alloys. Entropy 2018, 20, 908. [Google Scholar] [CrossRef]
Tshitoyan, V.; Dagdelen, J.; Weston, L.; Dunn, A.; Rong, Z.; Kononova, O.; Persson, K.A.; Ceder, G.; Jain, A. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 2019, 571, 95–98. [Google Scholar] [CrossRef]
Yurchenko, N.; Stepanov, N.; Salishchev, G. Laves-phase formation criterion for high-entropy alloys. Mater. Sci. Technol. 2017, 33, 17–22. [Google Scholar] [CrossRef]
Zhang, J.; Chen, X.; Ye, X.; Yang, Y.; Ai, B. Large language model in materials science: Roles, challenges, and strategic outlook. Adv. Intell. Discov. 2025, 202500085. [Google Scholar] [CrossRef]
Miret, S.; Krishnan, N.A. Enabling large language models for real-world materials discovery. Nat. Mach. Intell. 2025, 7, 991–998. [Google Scholar] [CrossRef]
Lv, T.; Zou, W.; He, J.; Ju, X.; Zheng, C. Study on the microstructure and properties of feconicral high-entropy alloy coating prepared by laser cladding–remelting. Coatings 2023, 14, 49. [Google Scholar] [CrossRef]
Cui, M.; Zhang, Y.; Xu, B.; Xu, F.; Chen, J.; Zhang, S.; Chen, C.; Luo, Z. Highentropy alloy nanomaterials for electrocatalysis. Chem. Commun. 2024, 60, 12615–12632. [Google Scholar] [CrossRef]
Yue, W.; Zhang, Y.; Zheng, Z.; Lai, Y. Hybrid laser additive manufacturing of metals: A review. Coatings 2024, 14, 315. [Google Scholar] [CrossRef]
Zheng, H.; Fu, J.; Wang, Y.; Dong, Y. Controlling microstructural gradients in laser-clad alcocrfeni2.1 eheas. Surf. Coat. Technol. 2025, 518, 132885. [Google Scholar] [CrossRef]
Hashemi, S.M.; Parvizi, S.; Baghbanijavid, H.; Tan, A.T.L.; Nematollahi, M.; Ramazani, A.; Fang, N.X.; Elahinia, M. Computational modelling of process–structure–property–performance relationships in metal additive manufacturing: A review. Int. Mater. Rev. 2022, 67, 1–46. [Google Scholar] [CrossRef]
Zhang, J.; Cai, C.; Kim, G.; Wang, Y.; Chen, W. Composition design of high-entropy alloys with deep sets learning. npj Comput. Mater. 2022, 8, 89. [Google Scholar] [CrossRef]
Gorsse, S.; Couziníe, J.-P.; Miracle, D.B. Database on the mechanical properties of high entropy alloys and complex concentrated alloys. Data Brief 2018, 21, 2664–2678. [Google Scholar] [CrossRef]
Swain, M.C.; Cole, J.M. Chemdataextractor: A toolkit for automated extraction of chemical information from the scientific literature. J. Chem. Inf. Model. 2016, 56, 1894–1904. [Google Scholar] [CrossRef]
Otis, R.; Liu, Z.-K. Pycalphad: Calphad-based computational thermodynamics in python. J. Open Res. Softw. 2017, 5, 13. [Google Scholar] [CrossRef]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; McGrew, B. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Hu, S.; Ouyang, M.; Gao, D.; Shou, M.Z. The dawn of gui agent: A preliminary case study with claude 3.5 computer use. arXiv 2024, arXiv:2411.10323. [Google Scholar] [CrossRef]
Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
Xu, P.; Ding, Y.; Fan, W. ChartAdapter: Large Vision-Language Model for Chart Summarization. arXiv 2024, arXiv:2412.20715. [Google Scholar]
Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. React: Synergizing reasoning and acting in language models. arXiv 2022, arXiv:2210.03629v3. [Google Scholar]
Venugopal, V.; Olivetti, E. Matkg: An autonomously generated knowledge graph in materials science. Sci. Data 2024, 11, 217. [Google Scholar] [CrossRef]
Lála, J.; O’Donoghue, O.; Shtedritski, A.; Cox, S.; Rodriques, S.G.; White, A.D. Paperqa: Retrieval-augmented generative agent for scientific research. arXiv 2023, arXiv:2312.07559. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-T.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Huang, S.; Cole, J.M. A database of battery materials auto-generated using chemdataextractor. Sci. Data 2020, 7, 260. [Google Scholar] [CrossRef] [PubMed]
Huang, S.; Cole, J.M. Batterybert: A pretrained language model for battery database enhancement. J. Chem. Inf. Model. 2022, 62, 6365–6377. [Google Scholar] [CrossRef] [PubMed]
Schick, T.; Dwivedi-Yu, J.; Dess`I, R.; Raileanu, R.; Lomeli, M.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language models can teach themselves to use tools. arXiv 2023, arXiv:2302.04761. [Google Scholar]
Jain, A.; Ong, S.P.; Chen, W.; Medasani, B.; Qu, X.; Kocher, M.; Brafman, M.; Petretto, G.; Rignanese, G.-M.; Hautier, G.; et al. Fireworks: A dynamic workflow system designed for high-throughput applications. Concurr. Comput. Pract. Exp. 2015, 27, 5037–5059. [Google Scholar] [CrossRef]
Weng, Y.; Gao, L.; Zhu, L.; Huang, J. Matqna: A benchmark dataset for multimodal large language models in materials characterization and analysis. arXiv 2025, arXiv:2509.11335. [Google Scholar]
Zipoli, F.; Viterbo, V.; Schilter, O.; Kahle, L.; Laino, T. Prediction of phase diagrams and associated phase structural properties. Ind. Eng. Chem. Res. 2022, 61, 8378–8389. [Google Scholar] [CrossRef]
Hao, Y.; Duo, L.; He, J. Autonomous materials synthesis laboratories: Integrating artificial intelligence with advanced robotics for accelerated discovery. ChemRxiv 2025. [Google Scholar] [CrossRef]
Maqsood, A.; Chen, C.; Jacobsson, T.J. The future of material scientists in an age of artificial intelligence. Adv. Sci. 2024, 11, 2401401. [Google Scholar] [CrossRef]
Kamnis, S. Introducing pre-trained transformers for high entropy alloy informatics. Mater. Lett. 2024, 358, 135871. [Google Scholar] [CrossRef]
Kamnis, S.; Delibasis, K. High entropy alloy property predictions using a transformer-based language model. Sci. Rep. 2025, 15, 11861. [Google Scholar] [CrossRef] [PubMed]
Zhen, S.; Zhang, L. AI-Driven Design of High-Entropy Alloys for Efficient Hydrogen Electrocatalysis. ChemRxiv 2025. [Google Scholar] [CrossRef]
Luo, M.; Xie, Z.; Li, H.; Zhang, B.; Cao, J.; Huang, Y.; Qu, H.; Zhu, Q.; Chen, L.; Jiang, J.; et al. Physics-informed, dual-objective optimization of high-entropy-alloy nanozymes by a robotic AI chemist. Matter 2025, 8. [Google Scholar] [CrossRef]
Choudhary, K. Atomgpt: Atomistic generative pretrained transformer for forward and inverse materials design. J. Phys. Chem. Lett. 2024, 15, 6909–6917. [Google Scholar] [CrossRef]
Kumar, S.; Sourav, A.; Yebaji, S.; Chauhan, L.; Babu, A.; Chelvane, A. Effect of heat treatment on the oxidation behavior of an alcocrfeni2 near-eutectic high entropy alloy. Corros. Sci. 2023, 221, 111298. [Google Scholar] [CrossRef]
Kanyane, L.R.; Malatji, N.; Shongwe, M.B. Hot corrosion, phase stability and compressive strength of alcrfenicu-nb high entropy alloy fabricated via additive manufacturing. Solid State Phenom. 2025, 378, 45–51. [Google Scholar] [CrossRef]
Zhao, Y.M.; Zhang, J.Y.; Liaw, P.K.; Yang, T. Machine learning-based computational design methods for high-entropy alloys. High Entropy Alloys Mater. 2025, 3, 41–100. [Google Scholar] [CrossRef]
Jiang, D.; Xie, L.; Wang, L. Current application status of multiscale simulation and machine learning in research on high entropy alloys. J. Mater. Res. Technol. 2023, 26, 1341–1374. [Google Scholar] [CrossRef]
Zhao, S.; Jiang, B.; Song, K.; Liu, X.; Wang, W.; Si, D.; Zhang, J.; Chen, X.; Zhou, C.; Liu, P.; et al. Machine learning assisted design of high-entropy alloys with ultra-high microhardness and unexpected low density. Mater. Des. 2024, 238, 112634. [Google Scholar] [CrossRef]
He, J.; Li, Z.; Zhao, P.; Zhang, H.; Zhang, F.; Wang, L.; Cheng, X. Machine learning-assisted design of high-entropy alloys with superior mechanical properties. J. Mater. Res. Technol. 2024, 33, 260–286. [Google Scholar] [CrossRef]
Raman, L.; Debnath, A.; Furton, E.; Lin, S.; Krajewski, A.; Ghosh, S.; Liu, N.; Ahn, M.; Poudel, B.; Shang, S.; et al. Data-driven inverse design of monbtivwzr refractory multicomponent alloys: Microstructure and mechanical properties. Mater. Sci. Eng. A 2024, 918, 147475. [Google Scholar] [CrossRef]
Yang, C.; Ren, C.; Jia, Y.; Wang, G.; Li, M.; Lu, W. A machine learning-based alloy design system to facilitate the rational design of high entropy alloys with enhanced hardness. Acta Mater. 2022, 222, 117431. [Google Scholar] [CrossRef]
Xie, E.; Yang, C. Ai design for high entropy alloys: Progress, challenges and future prospects. Metals 2025, 15, 1012. [Google Scholar] [CrossRef]
Zhang, Y.; Wen, C.; Dang, P.; Jiang, X.; Xue, D.; Su, Y. Elemental numerical descriptions to enhance classification and regression model performance for high-entropy alloys. npj Comput. Mater. 2025, 11, 75. [Google Scholar] [CrossRef]
Shen, F.; Yu, L.; Fu, T.; Zhang, Y.; Wang, H.; Cui, K.; Wang, J.; Hussain, S.; Akhtar, N. Effect of the al, cr and b elements on the mechanical properties and oxidation resistance of nb-si based alloys: A review. Appl. Phys. A 2021, 127, 852. [Google Scholar] [CrossRef]
Ni, B.; Glaser, B.; Taheri-Mousavi, S.M. End-to-end prediction and design of additively manufacturable alloys using a generative AlloyGPT model. npj Comput, Mater. 2025, 11, 294. [Google Scholar] [CrossRef]
Walker, N.; Trewartha, A.; Huo, H.; Lee, S.; Cruse, K.; Dagdelen, J.; Dunn, A.; Persson, K.; Ceder, G.; Jain, A. The impact of domain-specific pre-training on ner in materials science. SSRN 2021, 3950755. [Google Scholar] [CrossRef]
Gupta, T.; Zaki, M.; Krishnan, N.A.; Mausam. Matscibert: A materials domain language model for text mining and information extraction. npj Comput. Mater. 2022, 8, 102. [Google Scholar] [CrossRef]
Gupta, T.; Zaki, M.; Khatsuriya, D.; Hira, K.; Krishnan, N.M.A.; Mausam. Discomat: Distantly supervised composition extraction from tables. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; pp. 13465–13483. [Google Scholar]
Miret, S.; Krishnan, N.M.A. Are llms ready for real-world materials discovery? arXiv 2024, arXiv:2402.05200. [Google Scholar] [CrossRef]
Uhrin, L.; Huber, S.P.; Yu, J.; Marzari, N.; Pizzi, G.; Talirz, L. Workflows in AiiDA: Engineering a high-throughput, event-based engine for robust and modular computational workflows. Comput. Mater. Sci. 2021, 187, 110086. [Google Scholar] [CrossRef]
Katz, D.S.; Babuji, Y.N.; Woodard, A.; Li, Z.; Clifford, B.; Kumar, R.; Lacinski, L.; Chard, R.; Wozniak, J.M.; Foster, I.; et al. Parsl: Pervasive parallel programming in Python. In Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing (HPDC ’19), Phoenix, AZ, USA, 22–29 June 2019; ACM2019, pp. 25–36. [Google Scholar] [CrossRef]
Lee, J.W.; Park, W.B.; Lee, J.H.; Singh, S.P.; Sohn, K.S. A deep-learning technique for phase identification in multiphase inorganic compounds using synthetic XRD powder patterns. Nat. Commun. 2020, 11, 86. [Google Scholar] [CrossRef]
Szymanski, N.J.; Fu, S.; Persson, E.; Ceder, G. Integrated analysis of X-ray diffraction patterns and pair distribution functions for machine-learned phase identification. npj Comput. Mater. 2024, 10, 45. [Google Scholar] [CrossRef]
Vosoughi, A.; Shahnazari, A.; Xi, Y.; Zhang, Z.; Hess, G.; Xu, C.; Abdolrahim, N. OPENXRD: A Comprehensive Benchmark and Enhancement Framework for LLM/MLLM XRD Question Answering. arXiv 2025, arXiv:2507.09155. [Google Scholar] [CrossRef]
Park, W.B.; Chung, J.; Jung, J.; Sohn, K.; Singh, S.P.; Pyo, M.; Shin, N.; Sohn, K.S. Classification of crystal structure using a convolutional neural network. IUCrJ 2017, 4, 486–494. [Google Scholar] [CrossRef]
Samantaray, D.; Mondal, S.; Mishra, A. Nanoscale ordering in ptcu3 nanowires: Low-temperature synthesis and structural characterization of the l12 phase. Appl. Phys. A 2025, 131, 785. [Google Scholar] [CrossRef]
Schober, M.; Schnitzer, R.; Leitner, H. Precipitation evolution in a ti-free and ti-containing stainless maraging steel. Ultramicroscopy 2009, 109, 553–562. [Google Scholar] [CrossRef]
Keerthipalli, T.; Aepuru, R.; Biswas, A. Review on precipitation, intermetallic and strengthening of aluminum alloys. Proc. Inst. Mech. Eng. Part B J. Eng. Manuf. 2023, 237, 833–850. [Google Scholar] [CrossRef]
Li, Y.; Li, T.; Tang, L.; Ma, S.; Wu, Q.; Gupta, P.; Bauchy, M. Convfeatnet ensemble: Integrating microstructure and pre-defined features for enhanced prediction of porous material properties. Mater. Sci. Eng. A 2025, 931, 148173. [Google Scholar] [CrossRef]
Salgado, J.E.; Lerman, S.; Du, Z.; Xu, C.; Abdolrahim, N. Automated classification of big x-ray diffraction data using deep learning models. Npj Comput. Mater. 2023, 9, 214. [Google Scholar] [CrossRef]
Karpas, E.; Abend, O.; Belinkov, Y.; Lenz, B.; Lieber, O.; Ratner, N.; Shoham, Y.; Bata, H.; Levine, Y.; Leyton-Brown, K.; et al. MRKL systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning. arXiv 2022, arXiv:2205.00445. [Google Scholar] [CrossRef]
Foppiano, L.; Lambard, G.; Amagasa, T.; Ishii, M. Mining experimental data from materials science literature with large language models: An evaluation study. Sci. Technol. Adv. Mater. Methods 2024, 4, 2356506. [Google Scholar] [CrossRef]
Walczak, M.; Nowak, W.J.; Okuniewski, W.; Chocyk, D. Effect of adding molybdenum on the microstructure and corrosion resistance of alcocrfenimo0.25 high-entropy alloy. Materials 2025, 18, 4566. [Google Scholar] [CrossRef] [PubMed]
Huang, Z.-Q.; Chu, S.; Guo, Y.; Ouyang, J.; Ge, X.-W.; Zhang, Z.-J.; Wu, Y.-Y.; Li, C. Corrosion resistance prediction in high-entropy alloys and its application via a cpsp framework with mat-nrkg. npj Mater. Degrad. 2025, 9, 81. [Google Scholar] [CrossRef]
Feng, R.; Zhang, C.; Gao, M.C.; Pei, Z.; Zhang, F.; Chen, Y.; Ma, D.; An, K.; Poplawsky, J.D.; Ouyang, L. High-throughput design of high-performance lightweight high-entropy alloys. Nat. Commun. 2021, 12, 4329. [Google Scholar] [CrossRef]
Singh, M.; Barr, E.; Aidhy, D. Consolidated database of high entropy materials (COD’HEM): An open online database of high entropy materials. Comput. Mater. Sci. 2025, 248, 113588. [Google Scholar] [CrossRef]
Pei, Z.; Yin, J.; Zhang, J. Language models for materials discovery and sustainability: Progress, challenges, and opportunities. Prog. Mater. Sci. 2025, 154, 101495. [Google Scholar] [CrossRef]
Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Fung, P. Survey of hallucination in natural language generation. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
Chelli, M.; Descamps, J.; Lavoué, V.; Trojani, C.; Azar, M.; Deckert, M.; Ruetsch-Chelli, C. Hallucination rates and reference accuracy of ChatGPT and bard for systematic reviews: Comparative analysis. J. Med. Internet Res. 2024, 26, e53164. [Google Scholar] [CrossRef]
Walters, W.H.; Wilder, E.I. Fabrication and errors in the bibliographic citations generated by ChatGPT. Sci. Rep. 2023, 13, 14045. [Google Scholar] [CrossRef]
Orduña-Malea, E.; Cabezas-Clavijo, Á. ChatGPT and the potential growing of ghost bibliographic references. Scientometrics 2023, 128, 5351–5355. [Google Scholar] [CrossRef]
Li, X.; Li, Z.; Chen, C.; Ren, Z.; Wang, C.; Liu, X.; Zhang, Q.; Chen, S. Calphad as a powerful technique for design and fabrication of thermoelectric materials. J. Mater. Chem. A 2021, 9, 6634–6649. [Google Scholar] [CrossRef]
Emsley, R. Chatgpt: These are not hallucinations—they’re fabrications and falsifications. Schizophrenia 2023, 9, 52. [Google Scholar] [CrossRef]
Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst. 2025, 43, 1–55. [Google Scholar] [CrossRef]
Zheng, D.; Lapata, M.; Pan, J.Z. Large language models as reliable knowledge bases? arXiv 2024, arXiv:2407.13578. [Google Scholar] [CrossRef]
Manakul, P.; Liusie, A.; Gales, M. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 9004–9017. [Google Scholar]
Zhang, T.; Qiu, L.; Guo, Q.; Deng, C.; Zhang, Y.; Zhang, Z.; Zhou, C.; Wang, X.; Fu, L. Enhancing uncertainty-based hallucination detection with stronger focus. arXiv 2023, arXiv:2311.13230. [Google Scholar] [CrossRef]
Chen, C.; Liu, K.; Chen, Z.; Gu, Y.; Wu, Y.; Tao, M.; Fu, Z.; Ye, J. Inside: Llms’ internal states retain the power of hallucination detection. arXiv 2024, arXiv:2402.03744. [Google Scholar] [CrossRef]
Su, W.; Wang, C.; Ai, Q.; Hu, Y.; Wu, Z.; Zhou, Y.; Liu, Y. Unsupervised real-time hallucination detection based on the internal states of large language models. arXiv 2024, arXiv:2403.06448. [Google Scholar] [CrossRef]
Liang, T.; He, Z.; Jiao, W.; Wang, X.; Wang, Y.; Wang, R.; Yang, Y.; Shi, S.; Tu, Z. Encouraging divergent thinking in large language models through multiagent debate. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 17889–17904. [Google Scholar]
Moro, V.; Loh, C.; Dangovski, R.; Ghorashi, A.; Ma, A.; Chen, Z.; Kim, S.; Lu, P.Y.; Christensen, T.; Soljačić, M. Multimodal learning for materials. arXiv 2023, arXiv:2312.00111. [Google Scholar]
Katzer, B.; Steffen, K.; Katrin, S. Towards an automated workflow in materials science for combining multi-modal simulation and experimental information using data mining and large language models. Mater. Today Commun. 2025, 45, 112186. [Google Scholar] [CrossRef]
Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; Volume 202, pp. 19730–19742. Available online: https://proceedings.mlr.press/v202/li23q.html (accessed on 21 January 2026).
Irwin, R.; Dimitriadis, S.; He, J.; Bjerrum, E.J. Chemformer: A pre-trained transformer for computational chemistry. Mach. Learn. Sci. Technol. 2022, 3, 015022. [Google Scholar] [CrossRef]
Patterson, D.; Gonzalez, J.; Le, Q.; Liang, C.; Munguia, L.M.; Rothchild, D.; Dean, J. Carbon emissions and large neural network training. arXiv 2021, arXiv:2104.10350. [Google Scholar] [CrossRef]
Brandon, W.; Mishra, M.; Nrusimha, A.; Panda, R.; Ragan-Kelley, J. Reducing transformer key-value cache size with cross-layer attention. Adv. Neural Inf. Process. Syst. 2024, 37, 86927–86957. [Google Scholar]
Canty, R.B.; Bennett, J.A.; Brown, K.A.; Buonassisi, T.; Kalinin, S.V.; Kitchin, J.R.; Maruyama, B.; Moore, R.G.; Schrier, J.; Seifrid, M.; et al. Science acceleration and accessibility with self-driving labs. Nat. Commun. 2025, 16, 3856. [Google Scholar] [CrossRef] [PubMed]
Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’21), Virtual Event Canada, 3–10 March 2021; pp. 610–623. [Google Scholar] [CrossRef]
Mosqueira-Rey, E.; Hernández-Pereira, E.; Alonso-Ríos, D.; Bobes-Bascarán, J.; Fernández-Leal, Á. Human-in-the-loop machine learning: A state of the art. Artif. Intell. Rev. 2023, 56, 3005–3054. [Google Scholar] [CrossRef]
Zaki, M.; Jayadeva Mausam Krishnan, N.M.A. Mascqa: Investigating matrials science knowledge of large language models. Digit. Discov. 2024, 3, 313–327. [Google Scholar] [CrossRef]
Song, Y.; Miret, S.; Liu, B. Matsci-nlp: Evaluating scientific language models on materials science language tasks using text-to-schema modeling. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023. [Google Scholar]
Häse, F.; Roch, L.M.; Aspuru-Guzik, A. Next-generation experimentation with self-driving laboratories. Trends Chem. 2019, 1, 282–291. [Google Scholar] [CrossRef]
Boiko, D.A.; MacKnight, R.; Kline, B.; Gomes, G. Autonomous chemical research with large language models. Nature 2023, 624, 570–578. [Google Scholar] [CrossRef] [PubMed]
A Bennett, J.; Abolhasani, M. Autonomous chemical science and engineering enabled by self-driving laboratories. Curr. Opin. Chem. Eng. 2022, 36, 100831. [Google Scholar] [CrossRef]
Davis, E. Using a Large Language Model to Generate Program Mutations for a Genetic Algorithm to Search for Solutions to Combinatorial Problems: Review of (Romera-Paredes et al.). 2023. Available online: https://cs.nyu.edu/ (accessed on 21 January 2026).
Ridnik, T.; Kredo, D.; Friedman, I. Code generation with alphacodium: From prompt engineering to flow engineering. arXiv 2024, arXiv:2401.08500. [Google Scholar] [CrossRef]
Coja-Oghlan, A.; Loick, P.; Mezei, B.F.; Sorkin, G.B. The ising antiferromagnet and max cut on random regular graphs. arXiv 2020, arXiv:2009.10483. [Google Scholar] [CrossRef]
Völker, C.; Rug, T.; Jablonka, K.M.; Kruschwitz, S. Llms Can Design Sustainable Concrete—A Systematic Benchmark (Resubmitted Version). 2024. Available online: https://www.researchgate.net/publication/377722231_LLMs_can_Design_Sustainable_Concrete_-a_Systematic_Benchmark_re-submitted_version (accessed on 21 January 2026).
Zhao, S.; Chen, S.; Zhou, J.; Li, C.; Tang, T.; Harris, S.J.; Liu, Y.; Wan, J.; Li, X. Potential to transform words to watts with large language models in battery research. Cell Rep. Phys. Sci. 2024, 5, 101844. [Google Scholar] [CrossRef]
Lei, G.; Docherty, R.; Cooper, S.J. Materials science in the era of large language models: A perspective. Digit. Discov. 2024, 3, 1257–1272. [Google Scholar] [CrossRef]
Romera-Paredes, B.; Barekatain, M.; Novikov, A.; Balog, M.; Kumar, M.P.; Dupont, E.; Ruiz, F.J.R.; Ellenberg, J.S.; Wang, P.; Kohli, P.; et al. Mathematical discoveries from program search with large language models. Nature 2024, 625, 468–475. [Google Scholar] [CrossRef]
Blaiszik, B.; Ward, L.; Schwarting, M.; Gaff, J.; Chard, R.; Pike, D.; Chard, K.; Foster, I. The materials data facility: Data services to advance materials science research. JOM 2016, 68, 2045–2052. [Google Scholar] [CrossRef]
Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J.; Appleton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J.-W.; Silva Santos, L.B.; Bourne, P.E.; et al. The fair guiding principles for scientific data management and stewardship. Sci. Data 2016, 3, 160018. [Google Scholar] [CrossRef]
Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A. Benchmarking materials property prediction methods: The matbench test set and the automatminer reference algorithm. npj Comput. Mater. 2020, 6, 138. [Google Scholar] [CrossRef]
Alampara, N.; Schilling-Wilhelmi, M.; Ríos-García, M.; Mandal, I.; Khetarpal, P.; Grover, H.S.; Krishnan, N.M.A.; Jablonka, K.M. Probing the limitations of multimodal language models for chemistry and materials research. Nat. Comput. Sci. 2025, 5, 952–961. [Google Scholar] [CrossRef]
Peters, U.; Chin-Yee, B. Generalization bias in large language model summarization of scientific research. R. Soc. Open Sci. 2025, 12, 241776. [Google Scholar] [CrossRef]
Jain, A.; Ong, S.P.; Hautier, G.; Chen, W.; Richards, W.D.; Dacek, S.; Cholia, S.; Gunter, D.; Skinner, D.; Ceder, G.; et al. Commentary: The materials project: A materials genome approach to accelerating materials innovation. APL Mater. 2013, 1, 011002. [Google Scholar] [CrossRef]
Miret, S.; Lee, K.L.K.; Gonzales, C.; Nassar, M.; Spellings, M. The open matsci ml toolkit: A flexible framework for machine learning in materials science. arXiv 2022, arXiv:2210.17484. [Google Scholar] [CrossRef]
Frey, N.C.; Soklaski, R.; Axelrod, S.; Samsi, S.; Gómez-Bombarelli, R.; Coley, C.W.; Gadepally, V. Neural scaling of deep chemical models. Nat. Mach. Intell. 2023, 5, 1297–1305. [Google Scholar] [CrossRef]
Kim, E.; Huang, K.; Saunders, A.; McCallum, A.; Ceder, G.; Olivetti, E. Materials synthesis insights from scientific literature via text extraction and machine learning. Chem. Mater. 2017, 29, 9436–9444. [Google Scholar] [CrossRef]
Beltagy, I.; Lo, K.; Cohan, A. Scibert: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3615–3620. [Google Scholar]
Chen, M. Evaluating large language models trained on code. arXiv 2021, arXiv:2107.03374. [Google Scholar] [CrossRef]
Roziere, B.; Gehring, J.; Gloeckle, F.; Sootla, S.; Gat, I.; Tan, X.E.; Adi, Y.; Liu, J.; Sauvestre, R.; Remez, T.; et al. Code llama: Open foundation models for code. arXiv 2023, arXiv:2308.08589. [Google Scholar]
Masry, A.; Long, D.X.; Tan, J.Q.; Joty, S.; Hoque, E. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In Proceedings of the Findings of ACL 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 2263–2279. [Google Scholar]
Li, S.; Tajbakhsh, N. Scigraphqa: A large-scale synthetic multi-turn question answering dataset for scientific graphs. arXiv 2023, arXiv:2308.03349. [Google Scholar]
Abu-Odeh, A.; Galvan, E.; Kirk, T.; Mao, H.; Chen, Q.; Mason, P.; Malak, R.; Arroyave, R. Exploration of the high entropy alloy space as a constraint satisfaction problem. Acta Mater. 2018, 164, 1–11. [Google Scholar]
Lookman, T.; Balachandran, P.V.; Xue, D.; Yuan, R. Active learning in materials science with emphasis on adaptive sampling using uncertainties for targeted design. npj Comput. Mater. 2019, 5, 21. [Google Scholar] [CrossRef]
Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning 2017, Sydney, NSW, Australia, 6–11 August 2017; Volume 70, pp. 1321–1330, PMLR (2017). [Google Scholar]
Khajondetchairit, P.; Somdee, S.; Saelee, T.; Ektarawong, A.; Alling, B.; Praserthdam, P.; Rittiruam, M.; Praserthdam, S. Machine learning-accelerated density functional theory optimization of ptpdbased high-entropy alloys for hydrogen evolution catalysis. Int. J. Miner. Metall. Mater. 2025, 32, 2777–2785. [Google Scholar] [CrossRef]
Fei, Y.; Rendy, B.; Kumar, R.; Dartsi, O.; Sahasrabuddhe, H.P.; McDermott, M.J.; Wang, Z.; Szymanski, N.J.; Walters, L.N.; Milsted, D.; et al. Alabos: A python-based reconfigurable workflow management framework for autonomous laboratories. Digit. Discov. 2024, 3, 2275–2289. [Google Scholar] [CrossRef]

Figure 1. Schematic illustration of commonly discussed “core effects” in high-entropy alloys (HEAs) and their qualitative links to macroscopic properties. Reprinted with permission from Ref. [11], Copyright 2025, Springer.

Figure 2. Interplay between high-entropy alloys (HEAs) and machine learning (ML). (a) Ultimate tensile strength versus ultimate elongation at room temperature for representative HEAs, with steels shown for reference; the FeNiAlTi medium-entropy alloy and a representative transformation-induced-plasticity (TRIP) HEA are highlighted. Data and plot reprinted with permission from Ref. [18]. Copyright 2021, Springer. (b) Annual publication counts retrieved from ScienceDirect (Materials Science subject area), Nature Portfolio journals, and Physical Review journals using the keywords “machine learning materials” and “machine learning high-entropy alloys” (search date: 26 July 2022). (c) Timeline of representative high-performance HEAs. Reprinted with permission from Ref. [21]. Copyright 2023, Elsevier.

Figure 3. Electrochemical corrosion behavior of AlCoCrFeNi-based high-entropy alloys. (a) Potentiodynamic polarization curves of AlCoCrFeNiZr_x HEA coatings measured together with a carbon-steel comparator in a 3.5 wt.% NaCl solution. (b) Polarization curves of AlCoCrFeNi–X HEAs measured in 1 M NaCl solution. (c) Polarization curves of AlCoCrFeNi–X HEAs measured in 0.5 M H₂SO₄ solution. (d) Polarization curves of Al_0.25CoCrFeNi HEA measured in 0.5 M H₂SO₄ solution. (e) Polarization curves of Al_0.25CoCrFeNiPt_0.1 HEA measured in 0.5 M H₂SO₄ solution. The heterogeneous electrolytes and protocols shown here illustrate why reliable cross-study synthesis requires extracting and normalizing test metadata rather than directly pooling reported metrics. Reprinted with permission from Ref. [11]. Copyright 2025, Springer.

Figure 4. Schematic of laser cladding and laser remelting. (a) Laser cladding with coaxial powder feeding. (b) Laser remelting scan strategy; red arrows indicate remelting tracks, and the remelting direction is set perpendicular to the thermal-spraying direction to improve surface uniformity (as reported in the source). Adapted from Ref. [31].

Figure 5. Schematic illustrating solidification mechanisms in AlCoCrFeNi_2.1 eutectic HEA as a function of solidification velocity (V) and temperature gradient (G). The diagram summarizes the morphological transition of the FCC phase (planar → cellular → dendritic) and the coupled-growth regime that yields fine lamellar eutectics, which are relevant to mechanical-property tuning. Reprinted with permission from Ref. [34]. Copyright 2025, Elsevier.

Figure 6. Data-driven workflow for mapping elastic properties of high-entropy alloys. Quaternary compositions (selected from 14 elements) are evaluated by high-throughput EMTO–CPA calculations to generate elastic-property data; a Deep Sets model is then trained on the computed dataset, and association-rule mining is applied to the model outputs to extract and visualize composition–property trends as network graphs. Adapted from Ref. [36].

Figure 7. A schematic RDF (Resource Description Framework) schema of a materials knowledge graph (e.g., MatKG), illustrating how entities and relationships central to HEA research (compositions, processes, properties) can be structured for LLM-enabled knowledge extraction. The graph links alloys to their constituents, synthesis methods, characterized properties, and cited publications, enabling complex relational queries. Adapted from Ref. [45].

Figure 8. Database enhancement pipeline: transformer-based approach and traditional approach. Adapted from Ref. [49].

Figure 9. An example of data-driven prediction for (a) phase diagrams and (b) crystal structures, representing the type of multimodal information (thermodynamic, structural) that an HEA-focused multimodal LLM would need to interpret and integrate. The figure shows how machine learning models can predict stable phases across composition space and suggest likely crystal structures, tasks requiring joint reasoning of composition, energy, and symmetry. Adapted from Ref. [53].

Figure 10. SEM-EDS elemental mapping of an as-cast AlCoCrFeNi2 HEA, demonstrating elemental partitioning (Al/Ni vs. Co/Cr/Fe) and the resultant two-phase microstructure—an example of the microstructural data that an LLM-assisted system would extract and link to processing conditions. The figure highlights the type of visual information that must be connected to textual descriptors (e.g., “dendritic,” “interdendritic,” “B2 phase”) by a multimodal system. Reprinted with permission from Ref. [61]. Copyright 2025, Elsevier.

Figure 11. Typical workflow of ML-assisted HEA design, including data collection, preprocessing (feature engineering), model training, and validation/analysis in an iterative loop where new experiments/simulations update the dataset. Adapted from Ref. [63].

Figure 12. Strategies for generating element numerical descriptors for task-specific modeling. (a) Combining compositions with elemental descriptors to construct feature vectors. (b) Conventional descriptors (e.g., atomic radius and valence electron concentration) sample only a small region of the candidate descriptor space. (c) An example of using a genetic-algorithm-based framework to search a larger descriptor space and generate improved descriptors for a target task. Adapted from Ref. [69].

Figure 13. AlloyGPT framework for forward prediction of composition–structure–property relationships. (A) Formulating prediction as a sequence-completion task. (B,C) Example comparisons between predicted values and ground truth on test sets (as reported in the source). (D) Example analysis of prediction performance versus prompt length, illustrating the effect of including richer context (e.g., composition + structure). Adapted from Ref. [72].

Figure 14. Overview of requirements for a materials-focused LLM system, including knowledge acquisition, understanding of structure–property–behavior, synthesis/analysis procedures, and interfaces for human–machine and machine–machine interaction. The schematic highlights an LLM-centered architecture connecting knowledge sources and action modules. Adapted from Ref. [76].

Figure 15. Schematic of a ChartAdapter-style module integrated with a vision–language model for interpreting experimental plots (e.g., stress–strain curves). A chart encoder extracts visual features that are fused with the LLM through a cross-modal interaction block, enabling plot description and structured data extraction from scientific charts. Adapted from Ref. [43].

Figure 16. Illustration of data quality and baseline-model limitations in a consolidated high-entropy alloy database (e.g., CODHEM): (a) Comparison between experimentally measured elastic moduli and values predicted by a rule of mixtures (ROM). Single-phase alloys follow the trend; multiphase alloys scatter widely. (b) Comparison between experimental yield strengths and ROM estimates, showing near-random agreement, highlighting the dominant role of microstructure. Reprinted with permission from Ref. [93]. Copyright 2024, Materials Science and Engineering: A.

Figure 17. Schematic of the Chemformer pretraining-and-fine-tuning paradigm for chemistry: an encoder–decoder transformer is pre-trained on large-scale molecular strings and then fine-tuned for downstream tasks such as reaction prediction, molecular optimization, and property prediction. This illustrates how domain-specific pretraining can enable strong task performance with smaller, specialized models. Adapted from Ref. [111].

Figure 18. Schematic overview of a self-driving laboratory operating through iterative cycles of experiment design, material preparation, characterization and learning. The LLM acts as the “brain,” proposing experiments based on literature and prior results, which are executed by robotic systems, with data flowing back to update the model and guide the next cycle. Adapted from Ref. [54].

Figure 19. Human-in-the-loop machine learning (HITL-ML-relations) mind map. Adapted from Ref. [116].

Figure 20. Schematic of a FunSearch-style meta-optimization loop for constrained design. Given a problem specification and a set of high-performing example programs (e.g., scoring functions), an LLM generates candidate heuristics that are evaluated by an external deterministic scorer (e.g., a surrogate model, CALPHAD/DFT proxy, or database lookup). Top-performing candidates are retained and used to update the prompt for the next iteration, enabling iterative refinement under the chosen evaluator and constraints. Adapted from Ref. [127].

Figure 21. Benchmark evaluation of frontier vision–language models (VLMs) on materials-science tasks (MaCBench). (a) Performance above random baselines across task categories (e.g., data extraction, experimental understanding, and interpretation), illustrating category-dependent difficulty. (b) Radar-plot comparison across topic areas (e.g., polymers, catalysis, batteries), highlighting variability in model performance and domain-specific weaknesses. Adapted from Ref. [132].

Table 1. Functional and role comparison between traditional machine learning (ML) and large language models (LLMs) in HEA workflows.

Dimension	ML	LLMs	Synergy
Data	Structured/Numerical	Unstructured/Multimod	alText → structured data
Capability	Pattern recognition, regression	Semantic reasoning, generation	Feature optimization, work flow automation
Knowledge	Explicit patterns in data	Implicit logic in the literature	Provides prior knowledge for physical plausibility
Interaction	Code, parameters	Natural Language	NLP to invoke/configure ML models

Table 2. Timeline of recent (2024–2025) transformer and LLM applications in high-entropy alloy (HEA) research.

Year	Work	Task/Scope	Data	Originality/HEA Relevance	Scientific Integrity/ Reproducibility	Key Limitations
2024	Kamnis, S.—“Introducing pre-trained transformers for high entropy alloy informatics” [56]	Transfer-learning property prediction (theory → experiment)	Thermodynamic unlabeled pretrain; experimental HEA fine-tune	HEA-oriented pretrain–fine-tune framing; transfer across data sources	Needs leakage-safe splits; OOD generalization reporting; artifact release	Supervised only (non-agentic); limited failure-mode/condition-mismatch analysis
2025	Kamnis, S.—“HEA property predictions using a transformer-based language model” (journal) [57]	Peer-reviewed transfer learning; + interpretability	Synthetic pretrain; HEA fine-tune; k-fold	Journal-validated extension; broader analysis	Requires explicit uncertainty quantification; stricter out‑of‑distribution testing; full artifact release for reproducibility	Non-tool-using; sensitive to dataset bias and metadata gaps
2025	Zhen, S.—“AI-Driven Design of HEAs for H₂ electrocatalysis” (ChemRxiv) [58]	LLM-assisted literature curation + screening workflow	Literature-mined database + screening pipeline	End-to-end LLM support for curation-driven HEA workflow	Requires prompts + corpus snapshot; auditable inclusion/exclusion; pipeline artifacts	LLM mainly curation/organization; susceptible to literature-selection bias
2024	Luo, M.—“Robotic AI chemist for HEA nanozymes” (ChemRxiv) [59]	Closed-loop autonomy; LLM-in-the-loop (GPT-4) + BO	Robotic synthesis/testing + optimization loop	Agentic LLM integrated into autonomous HEA discovery	Needs prompt/guardrail transparency; logged decision traces; failure-case reporting; transfer/holdout tests	Preprint; task/assay-specific; limited evidence of cross-system generalization
2024	Choudhary, K.—“AtomGPT” (arXiv) [60]	General materials GPT (predict + generate)	Text/structure/property; DFT validation	Foundation-model direction; potentially adaptable to HEA tasks	For HEA: needs HEA benchmarks; clear splits/baselines; reproducible fine-tuning in-domain vs. out-of-domain performance	Not HEA-specific; HEA performance uncertain without domain adaptation

Table 3. Synthesis of core challenges, LLM-enabled opportunities, and priority actions for the integration of large language models into high-entropy alloy research.

Core Dimension	Key Challenges/Status Quo	LLM-Enabled Opportunities/Core Capabilities	Priority Actions/Considerations
Knowledge Management	Experimental data and theoretical insights are fragmented across a rapidly growing, unstructured literature.	Automated mining and synthesis of text to extract structured facts and build queryable knowledge graphs, integrating disparate findings.	Develop community-shared, high-quality HEA knowledge bases with standardized ontologies and robust entity-linking tools.
Design Paradigm	The composition-process-property relationship is high-dimensional, non-linear, and expensive to explore experimentally.	Data-driven prediction and generative inverse design, leveraging transfer learning and sequence modeling to propose plausible candidates.	Mandate rigorous out-of-distribution evaluation, uncertainty quantification, and hard physical-constraint enforcement for all generated suggestions.
Research Workflow	Multiscale simulation and experimental loops are often manual, siloed, and lack reproducibility.	Intelligent orchestration and agency, automating workflows from literature to simulation to experiment within a unified interface.	Create reproducible, human-in-the-loop LLM-agent frameworks with open toolchains and explicit audit trails for full transparency.
Evaluation Standard	A lack of trusted, domain-specific benchmarks for assessing LLM utility in materials discovery.	A unified interface for multi-ability evaluation (extraction, reasoning, generation, explanation) under scientifically meaningful metrics.	Establish community-adopted benchmarks that test multimodal understanding, causal reasoning, and real-world experimental validation hit rates.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Guo, Y.; Yang, C. Large Language Models for High-Entropy Alloys: Literature Mining, Design Orchestration, and Evaluation Standards. Metals 2026, 16, 162. https://doi.org/10.3390/met16020162

AMA Style

Guo Y, Yang C. Large Language Models for High-Entropy Alloys: Literature Mining, Design Orchestration, and Evaluation Standards. Metals. 2026; 16(2):162. https://doi.org/10.3390/met16020162

Chicago/Turabian Style

Guo, Yutong, and Chao Yang. 2026. "Large Language Models for High-Entropy Alloys: Literature Mining, Design Orchestration, and Evaluation Standards" Metals 16, no. 2: 162. https://doi.org/10.3390/met16020162

APA Style

Guo, Y., & Yang, C. (2026). Large Language Models for High-Entropy Alloys: Literature Mining, Design Orchestration, and Evaluation Standards. Metals, 16(2), 162. https://doi.org/10.3390/met16020162

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Large Language Models for High-Entropy Alloys: Literature Mining, Design Orchestration, and Evaluation Standards

Abstract

1. Introduction

2. Current Status and Core Challenges in High-Entropy Alloy Research

2.1. Condition Dependence and Fragmentation of HEA Evidence

2.2. Performance Optimization and Mechanistic Understanding

2.3. Advanced Fabrication and Processing Techniques

2.4. Summary of Key Challenges

2.5. Positioning Relative to Adjacent Paradigms

3. Core Capabilities of Large Language Models and Alignment with HEA Research

3.1. Large-Scale Text Comprehension, Knowledge Extraction and Natural-Language Reasoning

3.2. Code Generation and Workflow Automation

3.3. Cross-Modal and Logical Reasoning

3.4. LLMs vs. Traditional Machine Learning: Comparative Advantages and a Synergistic Paradigm

3.5. Emerging Landscape: Recent LLM Applications in HEA Research (2024–2025)

4. Key Application Scenarios of LLMs in HEA Research

4.1. Intelligent Literature Mining and Knowledge-Graph Construction

4.2. Data-Driven Composition and Process Design Assistant

4.2.1. Typical Workflow for ML-Assisted HEA Design

4.2.2. LLM-Augmented Composition and Process Design

4.3. Multiscale Simulation Interface and Integrator

4.4. Experimental Data Analysis and Mechanism Hypothesis Generation

4.5. Toward Executable, Trustworthy LLM Agent Architectures

5. Challenges in Applying LLMs to HEA Research

5.1. Case Study: LLM-Augmented Discovery of a Corrosion-Resistant HEA

5.2. Technical Challenges

5.2.1. Data Quality and Scarcity

5.2.2. Model Accuracy, Hallucination and Reliability

5.2.3. Integration of Multimodal Data and Context

5.3. Systemic and Pragmatic Challenges

5.3.1. Computational Resources and Domain Adaptation

5.3.2. Open-Source and Collaborative Infrastructure

5.3.3. Strategic Model Selection: Open vs. Closed Source

5.3.4. Ethical Considerations, Bias, and the Challenge of Trustworthy AI

5.3.5. Towards a Human-in-the-Loop Collaborative Model

6. Future Directions

6.1. Domain-Specialized Foundation Models

6.2. Closed-Loop, Autonomous Research Pipelines

6.3. Enhanced Scientific Discovery and Reasoning

6.3.1. LLM-Assisted Hypothesis Generation and Pattern Discovery

6.3.2. LLM-Centered Meta-Optimization and “Process Engineering”

6.4. Open and Collaborative Ecosystems

6.5. Benchmarks and Evaluation for HEA-Focused LLMs

7. Benchmarks, Protocols, and Considerations for Reproducibility

7.1. Foundational Resources for Development and Benchmarking

7.2. Evaluation Metrics

7.3. Special Considerations for Calibrating Scientific Numeric and Symbolic Outputs

8. Summary and Key Findings

9. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI