1. Introduction
Sodium-ion batteries (SIBs) are increasingly regarded as a compelling complement to lithium-ion batteries for large-scale energy storage because sodium resources are more geographically abundant and potentially lower-cost, while the chemistry remains broadly compatible with existing battery manufacturing infrastructure [
1]. Among candidate anodes, hard carbon has emerged as the leading choice due to its reversible capacity, moderate operating potential, and tolerance to diverse precursor chemistries [
2]. In particular, biomass-derived hard carbon offers an attractive pathway toward sustainable and scalable electrode production by leveraging renewable precursors and waste streams while enabling tunable carbon architectures through controllable thermochemical processing [
3]. Beyond precursor selection, heteroatom doping has been widely explored to modulate defect chemistry and local electronic structure, thereby influencing sodium storage behavior; sulfur doping is of special interest because it can introduce polar sites and expanded local environments that may facilitate ion adsorption and/or transport in disordered carbons [
4,
5]. In principle, the sodium-storage behavior of hard carbon is governed by its multiscale structure, including short-range disorder and surface functionalities, interlayer environments, and the distribution of open and closed porosity that collectively regulate adsorption, insertion, and low-voltage plateau storage [
6]. Early work by Stevens and Dahn established key mechanistic foundations for sodium and lithium insertion in carbon materials, including the limited sodium uptake in graphite and the roles of disordered interlayer environments and nanopore filling in hard carbons [
7,
8]. At the same time, this structure-rooted understanding does not imply that any single commonly reported descriptor can fully capture the sodium-hosting environments relevant to performance across different hard-carbon systems.
However, the mechanistic interpretation of sodium storage in hard carbon remains an active topic of discussion, and different materials or testing conditions can emphasize different storage modes. Consequently, commonly reported descriptors—such as
from XRD, Raman
, BET surface area, and pore volume—are valuable for describing carbon structure but do not always provide consistent, transferable correlations with electrochemical performance across literature datasets, partly because they are incomplete proxies for the relevant sodium-hosting motifs and can be sensitive to experimental protocols and reporting heterogeneity [
6,
9,
10]. This limitation becomes more pronounced for biomass-derived and heteroatom-doped hard carbons, where precursor chemistry and processing history can induce coupled changes in microstructure, surface chemistry, and pore architecture that are not fully captured by a small set of standard metrics [
11,
12,
13]. From a materials-formation perspective, precursor identity and processing conditions are not merely metadata; they act as upstream determinants of carbonization chemistry and therefore constrain the space of attainable microstructures and surface chemistries [
13,
14,
15].
Beyond the conceptual limitations of individual structural proxies, the evidence base for biomass-derived and heteroatom-doped hard carbons remains inherently fragmented across the literature. Reported capacities are obtained under diverse testing conditions, while key structural and surface-chemistry characterizations are often inconsistently reported, unavailable, or not directly comparable among studies. As a result, transferable precursor–process–performance relationships are difficult to extract through manual comparison alone. In this context, data-driven modeling, combined with cross-validation and controlled feature-set ablation, provides a complementary route to (i) assess how much performance variation can be captured by accessible upstream information, such as precursor identity and processing conditions, and (ii) identify which descriptors provide the most practically useful predictive signal within a heterogeneous literature-derived dataset [
13,
16,
17].
At the same time, recent studies on biomass-derived hard carbon have increasingly emphasized that precursor chemistry is not a trivial background variable, but a major determinant of carbonization behavior, aromatic condensation, defect evolution, and pore development. In particular, the relative abundance and coupling of lignin, polysaccharides, extractives, and mineral-associated components can strongly influence the structural pathway through which hard carbon forms. These studies suggest that precursor information may contain practically useful upstream signals, even when downstream structural characterization is sparse or non-uniformly reported across the literature.
This perspective motivates the use of domain-knowledge-guided precursor descriptors, i.e., structured and human-interpretable encodings of precursor chemistry and biopolymer architecture, such as biomass class, lignin-related grades, and qualitative indicators of crosslinking or structural rigidity, as complementary inputs to conventional process and structural variables [
13]. Importantly, these descriptors are not intended to replace structural characterization. Rather, they provide an explicit representation of formation-pathway information that can partially account for structural variability when measured structural parameters are sparse, noisy, or not directly comparable across literature sources, thereby improving predictive modeling and supporting screening-oriented workflows for biomass-derived and heteroatom-doped hard carbons.
Against this background, the central question of this study is not whether structure matters for sodium storage—which is already well established—but whether precursor- and process-governed formation pathways can be represented in a manner that meaningfully complements conventional structural descriptors in heterogeneous literature-derived settings. This question is particularly relevant for sulfur-containing biomass-derived hard carbons, where reported structural descriptors are often incomplete or non-uniform across studies, whereas precursor identity and processing history are more consistently accessible but remain underused in predictive modeling. To address this gap, we curated a focused literature-derived dataset and established a controlled feature-set comparison framework that systematically evaluates the predictive value of precursor descriptors, process variables, structural descriptors, and their combinations. We further introduce domain-knowledge-guided precursor descriptors as interpretable encodings of biomass chemistry and architecture, with the aim of testing whether accessible upstream information can support low-characterization, screening-oriented prediction while remaining consistent with the structure-rooted nature of sodium storage.
2. Materials and Methods
2.1. Data Acquisition and Curation
The overall workflow of this study is summarized in
Figure 1, including literature collection, data curation, feature-block construction, machine-learning training, and model interpretation. A broad literature search was conducted in Web of Science to identify biomass-related hard-carbon studies relevant to sodium-ion battery anodes. The initial search used biomass- and hard-carbon-related keywords and yielded 131 records. This search was intentionally broader than the final dataset scope and was not restricted to sulfur-related keywords at the initial stage, because sulfur-containing systems are often embedded within the wider biomass-derived hard-carbon literature. The retrieved records were then manually screened, and studies were retained only if they reported sulfur-containing biomass-derived hard carbons together with sufficiently extractable precursor information, process information, and electrochemical performance data. After screening, 16 journal articles were included in the final curated dataset [
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30,
31,
32,
33]. From these articles, we extracted 101 data records, where each record corresponds to a unique combination of precursor identity, processing conditions, and reported electrochemical performance. The complete reference list and the full tabulated dataset are provided in
Table S1 in the Supplementary Information (SI).
For each record, we collected three categories of input variables: (i) precursor-related descriptors (biomass chemistry and qualitative architectural indicators), (ii) process descriptors (pretreatment and pyrolysis parameters), and (iii) structural descriptors of the resulting hard carbon when available (e.g., Specific Surface Area (SSA), , and ). In addition, the current density (A/g) used in electrochemical testing was recorded from each article and treated as an input feature to account for testing-condition dependence in the reported capacity values.
The output variable considered in this study is the specific capacity (mAh g−1) reported in the corresponding article under the stated current density. We did not model initial Coulombic efficiency (ICE) in this study because ICE values were not consistently available and comparable across the selected sources.
Most records in this dataset correspond to sulfur-doped hard carbons, while a small number involve sulfur-containing co-doped systems reported within the same literature space. Accordingly, sulfur incorporation is treated here as a defining characteristic of the dataset rather than as an independently parameterized variable. Because sulfur source, sulfur content, and sulfurization route were not reported in a sufficiently consistent and comparable manner across the selected articles, sulfur-related parameters were not included as model inputs. This design allows the analysis to focus on how precursor- and process-governed formation information, together with available structural descriptors, explains capacity variability within the literature-derived sulfur-containing hard carbon dataset. Accordingly, the resulting models should be interpreted as learning statistical regularities within a sulfur-containing literature space rather than isolating sulfur-independent causal effects. Hidden variation in sulfur-related details may therefore contribute residual heterogeneity rather than being explicitly resolved by the present descriptor set.
2.2. Domain Knowledge–Driven Precursor (And Precursor-Conditioning) Descriptor Construction
A central methodological component of this study is the construction of domain-knowledge-guided descriptors that encode formation-pathway information linking precursor chemistry, biopolymer architecture, and upstream conditioning choices to the development of hard-carbon structure. In biomass-derived carbons, lignin abundance and lignin–carbohydrate network characteristics are widely recognized as influential factors governing carbonization behavior, aromatic condensation, and the evolution of porosity and disorder [
13,
34,
35,
36].
To construct a transferable descriptor set across heterogeneous literature sources, we performed a paper-by-paper inspection of the Methods and Experimental sections of the selected articles to identify the reported precursor identity and upstream treatments. When direct compositional measurements were unavailable, coarse-grained but interpretable labels were assigned based on feedstock taxonomy, reported precursor type, and experimental context. These descriptors were designed as consistent proxy encodings rather than precise compositional measurements. A detailed operational rubric and representative assignment examples for these coarse-grained precursor descriptors are provided in the
Supplementary Information.
The precursor-related descriptor block comprised four variables. Polysaccharides presence was encoded as a binary variable (0/1) to distinguish carbohydrate-dominant or synthetic carbohydrate precursors from feedstocks in which polysaccharides were not considered a defining component. Lignin level was represented as an ordinal variable (1–5) reflecting the relative lignin richness of the precursor class when direct compositional measurements were not available, with higher values indicating lignin-richer precursor types. Volatile grade was used as an ordinal variable to describe the qualitative abundance of extractive or volatile fractions that may influence carbonization behavior and surface chemistry, with higher values indicating stronger expected contributions from such fractions. Synergy level was introduced as an ordinal variable to capture qualitative differences in precursor-component coupling complexity, rather than as a direct measurement of chemical cross-linking, particularly the co-presence and interaction of lignin with other precursor constituents, which may influence aromatic condensation and the development of disorder and porosity.
In addition to the intrinsic precursor descriptors above, one upstream conditioning descriptor was included as part of the formation pathway. Postleaching category was encoded as a categorical variable to represent whether the precursor or carbon product underwent no post-treatment, water washing, or chemical post-treatment. This variable was included because such conditioning operations may alter ash content, surface chemistry, and accessible porosity before electrochemical testing [
37,
38].
For model implementation, binary descriptors were encoded as 0/1, ordinal descriptors were represented as ordered integers, and the postleaching category was encoded numerically according to its reported treatment type. For linear-model-based analyses, continuous variables were standardized prior to fitting. Overall, this descriptor system was designed as a parsimonious and interpretable proxy representation of precursor-governed formation information, rather than as a direct substitute for detailed compositional or structural characterization.
2.3. Machine Learning Experiment Design
To disentangle the relative contributions of upstream formation information and measured structural descriptors to capacity prediction, we designed a controlled ablation study with five feature-set configurations: S, structural descriptors only; P, process descriptors only; P + S, combined process and structural descriptors; F + P + S, domain-knowledge-guided precursor descriptors together with process and structural descriptors; and F + P, precursor descriptors together with process descriptors, representing a low-characterization setting. The electrochemical current density
was included as an input variable in all feature configurations in order to account for testing-condition dependence in the reported capacity values. A schematic overview of the feature groups and their combinations is provided in
Figure 1.
For each feature configuration, a common set of regression models was trained under the same evaluation framework to enable fair comparison across feature sets. The tested model family included linear models (multiple linear regression, ridge, lasso, and elastic net), kernel-based regression (support vector regression), instance-based learning (k-nearest neighbors), and tree-based ensemble methods (decision tree, random forest, AdaBoost, gradient boosting decision trees, XGBoost, and LightGBM). All models were trained to predict capacity only. Because several structural descriptors were not uniformly reported across all literature records, missing numerical values were imputed using mean values as a simple and transparent baseline strategy prior to model training. This choice was made to preserve a consistent record basis across all feature-set comparisons, rather than to imply that mean imputation is universally optimal for small heterogeneous datasets.
2.4. Model Evaluation
For all feature configurations, model performance was evaluated using the coefficient of determination (
) and the mean absolute error (MAE). The
score and MAE are defined as:
where
and
are the measured and predicted capacities for sample
,
is the mean measured capacity, and
is the number of samples. Higher
and lower MAE indicate better predictive performance [
39].
To estimate cross-validated predictive performance within the curated dataset, 5-fold cross-validation (CV) was adopted throughout this study. The dataset was randomly shuffled before splitting, and the same CV setting was used for all models and feature configurations to ensure a consistent basis for comparison. For algorithms sensitive to feature scaling, including linear models, support vector regression, and k-nearest neighbors, numerical features were standardized using training-fold statistics only in order to avoid information leakage. Tree-based models were trained on the original encoded features without scaling.
Hyperparameter optimization was performed for the nonlinear models using the search strategies described in the corresponding
Section 3, and the selected parameter settings were subsequently used for model comparison under each feature configuration. Final model performance was reported as the mean test-fold
and MAE across the five CV splits. A supplementary summary table has been added to summarize the descriptor blocks, feature-set configurations, evaluation framework, and model-specific hyperparameter search spaces used in this study (
Table S3).
2.5. Model Interpretation by SHAP Analysis
To interpret the contribution of individual descriptors to model predictions, SHAP analysis was applied to the final selected tree-based models under the relevant feature configurations [
40]. SHAP values were computed using the TreeSHAP algorithm as implemented in the SHAP library, which provides efficient feature-attribution analysis for decision-tree-based models [
41].
For each selected model, SHAP analysis was conducted after refitting the model on the corresponding dataset using the optimized hyperparameters obtained from the model-selection procedure. SHAP summary plots were used to rank features according to their mean absolute SHAP values, thereby reflecting their overall importance to the prediction task, and to visualize the distribution and direction of feature effects on the predicted capacity.
Because the purpose of this analysis was to compare model-learned feature contributions across different feature sets, SHAP results were interpreted as statistical associations learned from the curated literature-derived dataset rather than as causal effects. In this way, the SHAP analysis provides an interpretable basis for examining how precursor descriptors, process variables, and structural descriptors contribute to sodium-storage capacity prediction in sulfur-containing biomass-derived hard carbons.
3. Results
3.1. Dataset Overview and Problem Setting
We first summarize the curated literature-derived dataset and the empirical distributions of the key input descriptors in order to contextualize the subsequent modeling results. The dataset comprises 101 records extracted from 16 journal articles, with each record corresponding to a distinct combination of precursor identity, processing conditions, and reported electrochemical capacity for sulfur-containing biomass-derived hard-carbon anodes. In addition to precursor-related descriptors (F), process descriptors (P), and structural descriptors (S, when available), the electrochemical current density was included as an input variable to account for testing-condition dependence across literature sources.
The descriptor distributions are summarized in
Figure 2. The precursor-related and upstream-conditioning variables exhibit substantial heterogeneity across the curated literature space, indicating that the dataset spans a broad range of biomass classes, precursor chemistries, and treatment histories. Likewise, the process variables cover diverse preparation routes, including differences in carbonization strategy, treatment temperature, and residence time. Such variability is expected for literature-derived hard-carbon datasets and motivates the need for controlled feature-set comparisons rather than direct manual comparison alone.
The structural descriptor block also shows broad distributions, with measurable variation in specific surface area (SSA), interlayer spacing (), and Raman . At the same time, these descriptors are not uniformly available across all records, reflecting the practical reality that structural characterization is often incomplete or inconsistently reported in the literature. This uneven coverage is particularly relevant to the present study, because it motivates evaluation of low-characterization settings in which precursor and process information is used to complement or partially substitute for conventional structural inputs.
The distribution of current density further highlights the heterogeneity of the literature-derived dataset. Since electrochemical capacity depends not only on material properties but also on testing conditions, current density was incorporated into all feature configurations to improve comparability across sources. Taken together, the dataset represents a heterogeneous but practically relevant benchmark for evaluating whether accessible upstream information from precursors and processing can support predictive modeling of sodium-storage capacity in sulfur-containing biomass-derived hard carbons.
To further examine first-order relationships among the input descriptors and the reported electrochemical performance, Pearson correlation coefficients were computed and visualized as a clustered correlation heatmap (
Figure 3). Overall, the capacity shows only weak to moderate linear correlations with individual descriptors across the pooled literature-derived dataset, indicating that sodium-storage behavior in sulfur-containing biomass-derived hard carbons cannot be adequately captured by any single reported variable alone. This observation is consistent with the multiscale and interacting nature of sodium storage in hard carbon, as well as with the practical heterogeneity of literature-reported descriptors.
Among the pairwise associations with capacity, the strongest negative correlations are observed for polysaccharides () and current density (), while the correlations with most other individual descriptors remain comparatively weak. The negative correlation with current density is physically reasonable, as reported capacity generally decreases with increasing testing rate, and it further justifies including as an input variable in all feature configurations. The negative association with polysaccharides suggests that precursor class may influence the resulting carbon structure and sodium-storage behavior, although this effect is clearly not independent of processing history and other precursor-related variables.
The heatmap also reveals several stronger correlations among the input descriptors themselves. In particular, lignin content is strongly positively correlated with synergy (), while volatiles shows positive correlations with postleaching () and synergy (). In contrast, step carbonization is negatively correlated with lignin content () and synergy (). These relationships indicate that precursor chemistry, upstream conditioning, and process choices are not independent in the curated literature space, but instead tend to co-vary along broader formation pathways.
At the same time, the measured structural descriptors do not exhibit overwhelmingly strong direct correlations with capacity. For example, SSA shows only a very weak correlation with capacity (), while and display weak negative correlations ( and , respectively). This suggests that, within a pooled cross-study dataset, commonly reported structural metrics may not serve as universally transferable linear predictors of performance, even though they remain physically meaningful descriptors of carbon structure. This observation provides an important rationale for the subsequent use of nonlinear machine-learning models and controlled feature-set ablation.
It should be emphasized that this correlation analysis is descriptive rather than mechanistic. Pearson coefficients capture only pairwise linear associations and may be influenced by confounding factors, heterogeneous reporting practices, and uneven descriptor coverage across studies. Nevertheless, the observed correlation structure provides a useful exploratory baseline and supports the central premise of this work: precursor and process descriptors contain nontrivial upstream information, while conventional structural descriptors, although informative, are not by themselves sufficient to explain capacity variation across the literature-derived dataset.
3.2. Hyperparameter Optimization and Feature-Set Comparison
To compare predictive performance across feature configurations, hyperparameter optimization (HPO) was performed for the selected model candidates using both Bayesian optimization and random search, and the resulting optimization trajectories together with the final best cross-validated R
2 values are summarized in
Figure 4. In the upper panel, the curves represent the best-so-far 5-fold cross-validated test R
2 as a function of iteration number, whereas the lower panel compares the final best R
2 values obtained for the representative model–feature-set combinations. Although the two search strategies followed somewhat different optimization paths, they converged to the same overall ranking pattern across feature configurations, indicating that the comparative conclusions are not strongly dependent on the search strategy itself.
Across all feature configurations, the optimization curves exhibit a common pattern of rapid early improvement followed by gradual saturation. This suggests that a substantial fraction of model performance can be recovered during the early search stage, whereas later iterations mainly provide incremental gains. Bayesian optimization generally reaches competitive performance more quickly in the early stage, consistent with its ability to guide the search toward promising regions of the hyperparameter space. By contrast, random search more often displays delayed upward jumps, reflecting broader exploration and occasional discovery of favorable configurations at later iterations. Importantly, however, these differences in search trajectory do not change the broader conclusion regarding the relative predictive value of the descriptor blocks.
More importantly, the final performance comparison reveals a clear dependence on feature-set composition. The F + P + S configuration delivers the best overall predictive performance, reaching a maximum cross-validated R2 of 0.61, with nearly identical performance under Bayesian optimization and random search (R2 = 0.610 and 0.606, respectively). When structural descriptors are removed, the F + P configuration still maintains strong predictive ability, achieving R2 = 0.585 under the best setting. The performance gap between F + P + S and F + P is therefore small (ΔR2 = 0.025), indicating that precursor and process descriptors together retain a substantial fraction of the predictive information contained in the full feature set. In the context of a heterogeneous literature-derived dataset, this is a meaningful result, because precursor identity and processing history may preserve upstream formation-pathway information even when downstream structural characterization is incomplete or not uniformly comparable across studies.
By comparison, the conventional P + S setting reaches intermediate performance (R2 = 0.569 and 0.559), whereas the P and S configurations alone remain lower, with best R2 values close to 0.50. These results suggest that neither process descriptors nor measured structural descriptors alone are sufficient to explain the capacity variation observed across the curated literature space. More importantly, the fact that F + P outperforms P + S indicates that domain-knowledge-guided precursor descriptors are not merely auxiliary metadata, but can provide nontrivial predictive signal beyond a limited set of commonly reported structural proxies. This does not imply that structure is unimportant. Rather, it suggests that, in pooled cross-study datasets of sulfur-containing biomass-derived hard carbons, the structural descriptors most commonly available from the literature may not fully capture the sodium-hosting environments governing performance. Additional inspection of the squared-error behavior using Mean Squared Error (MSE) showed the same overall ranking pattern across feature-set configurations as R2 and MAE, indicating that the main comparative conclusions were not sensitive to the specific error metric used.
Taken together, these results support the central premise of this study: accessible upstream information from precursor identity and processing history captures meaningful formation-pathway information relevant to sodium storage. While the inclusion of structural descriptors still provides the highest predictive performance, the relatively small loss from F + P + S to F + P demonstrates that useful screening-level prediction can still be achieved when conventional structural characterization is unavailable, incomplete, or difficult to compare across studies. From a practical perspective, this reinforces the value of domain-knowledge-guided precursor encoding as a complementary route for low-characterization, literature-derived prediction and experimental prioritization.
3.3. SHAP-Based Interpretation Across Feature Configurations
To further understand how different information sources contribute to capacity prediction, SHAP analysis was performed for representative tree-based models under four feature configurations: P, P + S, F + P + S, and F + P. The corresponding SHAP summary plots are presented in
Figure 5. Because the purpose of this analysis is comparative rather than causal, the SHAP results are interpreted here as model-learned statistical associations that help explain why certain feature combinations yield stronger predictive performance than others.
Across all four configurations, current density () remains one of the most influential variables. In every model, its SHAP distribution spans a wide range, indicating that testing rate exerts a substantial effect on the reported capacity values extracted from the literature. In general, higher current density tends to contribute negatively to predicted capacity, whereas lower current density is more often associated with positive SHAP values. This result is physically reasonable and further supports the decision to include in all feature configurations when learning from heterogeneous cross-study data.
Among the process-related variables, step carbonization, pyrolysis temperature (PT), activation, and residence time () repeatedly appear as influential descriptors. In the P model, these process variables dominate the prediction almost entirely, indicating that, even without structural or precursor descriptors, the model is able to extract a meaningful but limited signal from synthesis history alone. When structural descriptors are introduced in the P + S model, variables such as SSA, , and become prominent, confirming that measured structural characteristics provide additional predictive information. This is consistent with the physical expectation that sodium storage in hard carbon is ultimately rooted in structure.
The most informative comparison is between F + P + S and F + P. In the F + P + S model, the dominant features include a mixture of testing-condition, process, structural, and precursor-related descriptors. In addition to , process variables and structural metrics such as , SSA, and contribute substantially, while precursor-related descriptors including lignin content, volatiles, synergy, and polysaccharides remain non-negligible. This indicates that precursor information contributes beyond what is captured by conventional structural descriptors alone.
When structural descriptors are removed, the F + P model shows a clear redistribution of importance. Process variables remain important, but precursor-related descriptors become more prominent, especially lignin content, volatiles, and synergy. In other words, when direct structural measurements are unavailable, the model shifts part of its reliance toward precursor descriptors that encode upstream formation information. This behavior provides a mechanistic rationale for the strong performance of the F + P setting observed in
Figure 4. Although the F + P model does not fully match the predictive power of F + P + S, its performance remains close, suggesting that domain-knowledge-guided precursor descriptors can partially compensate for missing structural characterization.
Taken together, the SHAP analysis supports a coherent interpretation of the modeling results. Structure-related variables remain important whenever they are available, but precursor and process descriptors also carry substantial predictive information that is not redundant with conventional structural proxies. This explains why the full feature set (F + P + S) achieves the highest performance, while the low-characterization configuration (F + P) remains highly competitive. More broadly, these results support the central premise of this work: precursor identity and processing history encode meaningful formation-pathway information that can be leveraged for interpretable, screening-oriented prediction of sodium-storage capacity in sulfur-containing biomass-derived hard carbons. Because several precursor descriptors are coarse-grained ordinal proxies in a relatively small and heterogeneous literature-derived dataset, the SHAP results are interpreted here at the level of relative contribution and importance redistribution rather than as stable monotonic dependence relationships.
4. Discussion
The results of this study support a consistent interpretation of capacity prediction in sulfur-containing biomass-derived hard carbons: electrochemical behavior remains fundamentally structure-rooted, but in literature-derived datasets, the commonly reported structural descriptors are often incomplete, unevenly reported, and not always sufficiently transferable to serve as strong stand-alone predictors across studies. This is reflected in the present results, where the structure-only setting does not outperform the other feature configurations, and where the conventional P + S combination remains inferior to both F + P + S and F + P. In other words, the limited performance of the structure-based configurations should not be interpreted as evidence that structure is unimportant; rather, it indicates that the specific structural proxies typically available from the literature, such as SSA, , and , do not fully capture the sodium-hosting environments relevant to capacity when data are pooled across heterogeneous sources.
A central finding is that the inclusion of domain-knowledge-guided precursor descriptors leads to a clear improvement in predictive performance. The full-information model (F + P + S) achieves the highest cross-validated performance (), but the low-characterization configuration (F + P) remains highly competitive (), with only a small performance loss relative to the full model (). At the same time, F + P outperforms the conventional P + S configuration. This comparison is particularly important because it suggests that precursor descriptors are not merely supplementary metadata. Instead, when constructed in a domain-guided and interpretable manner, they encode meaningful formation-pathway information that complements process variables and partially compensates for missing structural characterization.
The SHAP analysis further clarifies how this improvement arises. Across all feature configurations, current density remains one of the most influential variables, confirming that test conditions strongly affect the reported capacities extracted from the literature and should be explicitly accounted for when modeling heterogeneous literature-derived data. Process variables such as step carbonization, pyrolysis temperature, activation, and residence time also repeatedly appear among the most important descriptors, indicating that synthesis history carries substantial predictive information. When structural descriptors are included, variables such as SSA, , and become important, which is consistent with the physical expectation that sodium storage is ultimately governed by carbon structure. However, when structural descriptors are removed, the model does not collapse. Instead, the importance of precursor-related descriptors, especially lignin content, volatiles, and synergy, increases substantially. This redistribution of feature importance provides a direct explanation for why the F + P model remains effective: precursor descriptors act as upstream indicators of how carbon structure is likely to evolve during processing, even when direct structural measurements are unavailable.
This interpretation is also consistent with the correlation analysis. Capacity does not show strong linear dependence on any single descriptor, and the measured structural descriptors themselves are only weakly correlated with capacity in the pooled dataset. Meanwhile, several precursor and process variables show stronger correlations with one another, suggesting that literature-derived hard-carbon datasets contain coupled patterns of precursor chemistry, conditioning history, and synthesis strategy rather than isolated, independently varying factors. Under such conditions, nonlinear models are better suited than simple linear comparisons, and feature-set ablation becomes necessary to determine which categories of information are truly useful for prediction.
From a practical perspective, these results support a screening-oriented workflow that does not rely on complete structural characterization at the earliest stage. In many biomass-derived hard-carbon studies, precursor diversity is large, synthesis routes are numerous, and detailed structural characterization is time-consuming and not always consistently reported. The strong performance of the F + P configuration suggests that accessible upstream information from precursor identity and processing history can be used for first-pass ranking and prioritization, while more detailed structural characterization can be reserved for a smaller number of promising candidates. In this sense, the present framework is not intended to replace structural analysis, but rather to complement it by providing an interpretable and practically deployable prediction route under low-characterization conditions.
Several limitations should nevertheless be acknowledged. First, the dataset is literature-derived and relatively small, and descriptor coverage remains uneven across studies. In addition, the precursor descriptors used here are coarse-grained, domain-guided proxy encodings derived from literature information, and some degree of assignment subjectivity is therefore unavoidable even with an explicit rubric. Moreover, the present descriptor system was designed for sulfur-containing biomass-derived hard carbons, and its transferability to non-sulfur-containing systems or to carbons derived from synthetic polymers or fossil precursors remains to be tested in future work. Second, sulfur-related details were not reported in a sufficiently consistent manner to be parameterized explicitly, so sulfur incorporation was treated as part of the dataset definition rather than as an independent modeling variable. Third, although current density was included to improve comparability, other experimental factors, such as electrolyte composition, electrode loading, and voltage window, may still contribute residual noise. Accordingly, the present results should be interpreted as model-learned statistical regularities within a curated literature-derived dataset rather than as universal causal rules. Even with these constraints, the consistency of the feature-ablation results and the SHAP-based redistribution of importance support the central conclusion of this work: domain-knowledge-guided precursor descriptors provide a practical and interpretable complement to conventional structural descriptors for predicting sodium-storage capacity in sulfur-containing biomass-derived hard carbons.
5. Conclusions
In this work, we curated a focused literature-derived dataset of sulfur-containing biomass-derived hard carbons and compared the predictive value of precursor, process, and structural descriptors. The results show that precursor-plus-process information remained highly competitive and outperformed the conventional process-plus-structure setting.
Within this curated literature space, these findings suggest that domain-knowledge-guided precursor descriptors can complement conventional structural descriptors and support low-characterization, screening-oriented prediction, rather than replace structural analysis.
Supplementary Materials
The following supporting information can be downloaded at:
https://www.mdpi.com/article/10.3390/app16083706/s1, Table S1: Full list of datasets; Table S2: Operational rubric and representative examples for assigning domain-knowledge-guided precursor descriptors; Table S3: Summary of the machine-learning experimental design, descriptor blocks, evaluation framework, and model-specific hyperparameter search spaces.
Author Contributions
Conceptualization, S.W. (Shule Wang), M.F. and K.S.; methodology, S.W. (Shule Wang) and C.Y.; software, S.W. (Shule Wang); validation, C.Y., J.L. and Y.J.; formal analysis, S.W. (Shitao Wen); investigation, C.Y. and S.Q.; resources, M.F. and A.W.; data curation, Y.J., S.W. (Shitao Wen), J.L. and C.Y.; writing—original draft preparation, C.Y. and S.W. (Shule Wang); writing—review and editing, K.S., M.F. and S.W. (Shule Wang); visualization, C.Y. and S.W. (Shule Wang); supervision, K.S.; project administration, K.S.; funding acquisition, S.W. (Shule Wang) and K.S. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the Fundamental Research Funds of CAF (No. CAFYBB2023ZA011). The article processing charge (APC) was supported by INRAE (Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement) through a Chaire de Professeur Junior (CPJ) funding of S.W. (Shule Wang), contract no. 168-2025.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
All the data used to generate the results in this study are explicitly documented within the main text,
Table S1 and references.
Acknowledgments
During manuscript preparation, the authors used ChatGPT 5.4 (OpenAI, San Francisco, CA, USA) for grammar correction and language enhancement. All outputs were reviewed and verified by the authors.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Phogat, P.; Dey, S.; Wan, M. Comprehensive review of Sodium-Ion Batteries: Principles, Materials, Performance, Challenges, and future Perspectives. Mater. Sci. Eng. B 2025, 312, 117870. [Google Scholar] [CrossRef]
- Wu, C.; Yang, Y.; Zhang, Y.; Xu, H.; He, X.; Wu, X.; Chou, S. Hard carbon for sodium-ion batteries: Progress, strategies and future perspective. Chem. Sci. 2024, 15, 6244–6268. [Google Scholar] [CrossRef] [PubMed]
- Zhong, B.; Liu, C.; Xiong, D.; Cai, J.; Li, J.; Li, D.; Cao, Z.; Song, B.; Deng, W.; Peng, H.; et al. Biomass-Derived Hard Carbon for Sodium-Ion Batteries: Basic Research and Industrial Application. ACS Nano 2024, 18, 16468–16488. [Google Scholar] [CrossRef] [PubMed]
- Samanta, R.; Roy, S.; Barman, S. Sulfur and Nitrogen Codoped Hard Carbon with Expanded Interlayer Distance as an Effective Anode Material for Sodium-Ion Batteries. Energy Fuels 2024, 38, 19867–19877. [Google Scholar] [CrossRef]
- Wan, B.; Zhang, H.; Tang, S.; Li, S.; Wang, Y.; Wen, D.; Zhang, M.; Li, Z. High-sulfur-doped hard carbon for sodium-ion battery anodes with large capacity and high initial coulombic efficiency. Sustain. Energy Fuels 2022, 6, 4338–4345. [Google Scholar] [CrossRef]
- Xu, L.; Li, Y.; Xiang, Y.; Li, C.; Zhu, H.; Li, C.; Zou, G.; Hou, H.; Ji, X. Bridging Structure and Performance: Decoding Sodium Storage in Hard Carbon Anodes. ACS Nano 2025, 19, 14627–14651. [Google Scholar] [CrossRef]
- Stevens, D.A.; Dahn, J.R. High Capacity Anode Materials for Rechargeable Sodium-Ion Batteries. J. Electrochem. Soc. 2000, 147, 1271. [Google Scholar] [CrossRef]
- Stevens, D.A.; Dahn, J.R. The Mechanisms of Lithium and Sodium Insertion in Carbon Materials. J. Electrochem. Soc. 2001, 148, A803. [Google Scholar] [CrossRef]
- Stratford, J.M.; Kleppe, A.K.; Keeble, D.S.; Chater, P.A.; Meysami, S.S.; Wright, C.J.; Barker, J.; Titirici, M.-M.; Allan, P.K.; Grey, C.P. Correlating Local Structure and Sodium Storage in Hard Carbon Anodes: Insights from Pair Distribution Function Analysis and Solid-State NMR. J. Am. Chem. Soc. 2021, 143, 14274–14286. [Google Scholar] [CrossRef]
- Zeng, Y.; Yang, J.; Yang, H.; Yang, Y.; Zhao, J. Bridging Microstructure and Sodium-Ion Storage Mechanism in Hard Carbon for Sodium Ion Batteries. ACS Energy Lett. 2024, 9, 1184–1191. [Google Scholar] [CrossRef]
- Kitchamsetti, N.; Kim, K.-h.; Han, H.; Mhin, S. Biomass-Derived Hard Carbon Anodes for Sodium-Ion Batteries: Recent Advances in Synthesis Strategies. Nanomaterials 2025, 15, 1554. [Google Scholar] [CrossRef]
- Zhang, H.; Lin, S.; Shu, C.; Tang, Z.; Wang, X.; Wu, Y.; Tang, W. Advances and perspectives of hard carbon anode modulated by defect/hetero elemental engineering for sodium ion batteries. Mater. Today 2025, 85, 231–252. [Google Scholar] [CrossRef]
- Li, J.; Jin, Y.; Sun, K.; Wang, A.; Zhang, G.; Zhou, L.; Yang, W.; Fan, M.; Jiang, J.; Wen, Y.; et al. Unveiling the role of lignin in biomass-derived hard carbon anodes via machine learning. J. Power Sources 2025, 631, 236323. [Google Scholar] [CrossRef]
- He, X.-X.; Li, L.; Wu, X.; Chou, S.-L. Sustainable Hard Carbon for Sodium-Ion Batteries: Precursor Design and Scalable Production Roadmaps. Adv. Mater. 2025, 37, 2506066. [Google Scholar] [CrossRef]
- del Mar Saavedra Rios, C.; Simonin, L.; Ghimbeu, C.M.; Vaulot, C.; da Silva Perez, D.; Dupont, C. Impact of the biomass precursor composition in the hard carbon properties and performance for application in a Na-ion battery. Fuel Process. Technol. 2022, 231, 107223. [Google Scholar] [CrossRef]
- Himanen, L.; Geurts, A.; Foster, A.S.; Rinke, P. Data-Driven Materials Science: Status, Challenges, and Perspectives. Adv. Sci. 2019, 6, 1900808. [Google Scholar] [CrossRef]
- Zhang, Y.; Ling, C. A strategy to apply machine learning to small datasets in materials science. npj Comput. Mater. 2018, 4, 25. [Google Scholar] [CrossRef]
- Aristote, N.T.; Song, Z.; Deng, W.; Hou, H.; Zou, G.; Ji, X. Effect of double and triple-doping of sulfur, nitrogen and phosphorus on the initial coulombic efficiency and rate performance of the biomass derived hard carbon as anode for sodium-ion batteries. J. Power Sources 2023, 558, 232517. [Google Scholar] [CrossRef]
- Wang, H.; Chen, H.; Chen, C.; Li, M.; Xie, Y.; Zhang, X.; Wu, X.; Zhang, Q.; Lu, C. Tea-derived carbon materials as anode for high-performance sodium ion batteries. Chin. Chem. Lett. 2023, 34, 107465. [Google Scholar] [CrossRef]
- de Tomas, C.; Alabidun, S.; Chater, L.; Darby, M.T.; Raffone, F.; Restuccia, P.; Au, H.; Titirici, M.M.; Cucinotta, C.S.; Crespo-Ribadenyra, M. Doping carbon electrodes with sulfur achieves reversible sodium ion storage. J. Phys. Energy 2023, 5, 024006. [Google Scholar] [CrossRef]
- Muruganantham, R.; Wang, F.-M.; Liu, W.-R. A green route N, S-doped hard carbon derived from fruit-peel biomass waste as an anode material for rechargeable sodium-ion storage applications. Electrochim. Acta 2022, 424, 140573. [Google Scholar] [CrossRef]
- Aristote, N.T.; Liu, C.; Deng, X.; Liu, H.; Gao, J.; Deng, W.; Hou, H.; Ji, X. Sulfur-doping biomass based hard carbon as high performance anode material for sodium-ion batteries. J. Electroanal. Chem. 2022, 923, 116769. [Google Scholar] [CrossRef]
- Li, X.; Yang, C.; Wang, S.; Mao, X.; Yu, K. Comprehensive study on improving the sodium storage performance of low-defect biomass-derived carbon through S or N doping. Diam. Relat. Mater. 2022, 129, 109382. [Google Scholar] [CrossRef]
- Wan, H.; Shen, X.; Jiang, H.; Zhang, C.; Jiang, K.; Chen, T.; Shi, L.; Dong, L.; He, C.; Xu, Y.; et al. Biomass-derived N/S dual-doped porous hard-carbon as high-capacity anodes for lithium/sodium ions batteries. Energy 2021, 231, 121102. [Google Scholar] [CrossRef]
- Zhao, G.; Yu, D.; Zhang, H.; Sun, F.; Li, J.; Zhu, L.; Sun, L.; Yu, M.; Besenbacher, F.; Sun, Y. Sulphur-doped carbon nanosheets derived from biomass as high-performance anode materials for sodium-ion batteries. Nano Energy 2020, 67, 104219. [Google Scholar] [CrossRef]
- Cao, L.; Wang, Y.; Hu, H.; Huang, J.; Kou, L.; Xu, Z.; Li, J. A N/S-codoped disordered carbon with enlarged interlayer distance derived from cirsium setosum as high-performance anode for sodium ion batteries. J. Mater. Sci. Mater. Electron. 2019, 30, 21323–21331. [Google Scholar] [CrossRef]
- Feng, P.; Wang, W.; Wang, K.; Cheng, S.; Jiang, K. A high-performance carbon with sulfur doped between interlayers and its sodium storage mechanism as anode material for sodium ion batteries. J. Alloys Compd. 2019, 795, 223–232. [Google Scholar] [CrossRef]
- He, L.; Sun, Y.-r.; Wang, C.-l.; Guo, H.-y.; Guo, Y.-q.; Li, C.; Zhou, Y. High performance sulphur-doped pitch-based carbon materials as anode materials for sodium-ion batteries. New Carbon Mater. 2020, 35, 420–427. [Google Scholar] [CrossRef]
- Jin, Q.; Li, W.; Wang, K.; Feng, P.; Li, H.; Gu, T.; Zhou, M.; Wang, W.; Cheng, S.; Jiang, K. Experimental design and theoretical calculation for sulfur-doped carbon nanofibers as a high performance sodium-ion battery anode. J. Mater. Chem. A 2019, 7, 10239–10245. [Google Scholar] [CrossRef]
- Li, Q.; Zhang, Y.-N.; Feng, S.; Liu, D.; Wang, G.; Tan, Q.; Jiang, S.; Yuan, J. N, S self-doped porous carbon with enlarged interlayer distance as anode for high performance sodium ion batteries. Int. J. Energy Res. 2021, 45, 7082–7092. [Google Scholar] [CrossRef]
- Sun, M.; Qu, Y.; Zeng, F.; Yang, Y.; Xu, K.; Yuan, C.; Lu, Z.-H. Hierarchical Porous and Sandwich-like Sulfur-Doped Carbon Nanosheets as High-Performance Anodes for Sodium-Ion Batteries. Ind. Eng. Chem. Res. 2022, 61, 2126–2135. [Google Scholar] [CrossRef]
- Wang, Q.; Ge, X.; Xu, J.; Du, Y.; Zhao, X.; Si, L.; Zhou, X. Fabrication of Microporous Sulfur-Doped Carbon Microtubes for High-Performance Sodium-Ion Batteries. ACS Appl. Energy Mater. 2018, 1, 6638–6645. [Google Scholar] [CrossRef]
- Zhao, G.; Zou, G.; Hou, H.; Ge, P.; Cao, X.; Ji, X. Sulfur-doped carbon employing biomass-activated carbon as a carrier with enhanced sodium storage behavior. J. Mater. Chem. A 2017, 5, 24353–24360. [Google Scholar] [CrossRef]
- He, J.; Lan, N.; Yu, H.; Du, D.; He, H.; Zhang, C. Chemical crosslinking regulating microstructure of lignin-derived hard carbon for high-performance sodium storage. J. Polym. Sci. 2024, 62, 3216–3224. [Google Scholar] [CrossRef]
- Zhang, G.; Chen, C.; Xu, C.; Li, J.; Ye, H.; Wang, A.; Cao, X.; Sun, K.; Jiang, J. Unraveling the Microcrystalline Carbon Evolution Mechanism of Biomass-Derived Hard Carbon for Sodium-Ion Batteries. Energy Fuels 2024, 38, 8326–8336. [Google Scholar] [CrossRef]
- Wang, A.; Zhang, G.; Li, M.; Sun, Y.; Tang, Y.; Sun, K.; Lee, J.-M.; Fu, G.; Jiang, J. Lignin derived hard carbon for sodium ion batteries: Recent advances and future perspectives. Prog. Mater. Sci. 2025, 152, 101452. [Google Scholar] [CrossRef]
- Jawerth, M.E.; Brett, C.J.; Terrier, C.; Larsson, P.T.; Lawoko, M.; Roth, S.V.; Lundmark, S.; Johansson, M. Mechanical and Morphological Properties of Lignin-Based Thermosets. ACS Appl. Polym. Mater. 2020, 2, 668–676. [Google Scholar] [CrossRef]
- Poursorkhabi, V.; Abdelwahab, M.A.; Misra, M.; Khalil, H.; Gharabaghi, B.; Mohanty, A.K. Processing, Carbonization, and Characterization of Lignin Based Electrospun Carbon Fibers: A Review. Front. Energy Res. 2020, 8, 208. [Google Scholar] [CrossRef]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 4768–4777. [Google Scholar]
- Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.-I. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2020, 2, 56–67. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |