You are currently viewing a new version of our website. To view the old version click .
Materials
  • Article
  • Open Access

10 December 2025

Interpreting Molecular Descriptors for Glass Transition Temperature Prediction and Design of Polyimides

,
,
and
School of Materials and Energy, Guangdong University of Technology, Guangzhou 510006, China
*
Author to whom correspondence should be addressed.
This article belongs to the Section Materials Simulation and Design

Highlights

What are the main findings?
  • A robust seven-descriptor QSPR model predicts polyimide Tg via GA-MLR.
  • Key descriptors (Chi0n, MinPartialCharge) govern chain rigidity and interactions.
  • Tg is controlled by modulation of free volume, unifying the structure-property link.
What are the implications of the main findings?
  • The model provides direct, interpretable physicochemical insights into Tg.
  • Free volume theory offers a unified mechanism for diverse molecular features.
  • Actionable design principles are given for tailoring Tg via molecular architecture.

Abstract

The rational design of polyimides (PIs) with targeted glass transition temperature (Tg) is crucial for advanced microelectronics applications. While data-driven approaches offer promise, there is a pressing need for models that are not only predictive but also physically interpretable, especially with limited datasets. Herein, we present a highly interpretable Quantitative Structure-Property Relationship (QSPR) model for accurate Tg prediction of PIs. Employing a Genetic Algorithm combined with Multiple Linear Regression (GA-MLR), we identified an optimal set of seven molecular descriptors from a curated dataset. The model demonstrates robust predictive performance and strong generalization ability, validated through rigorous statistical tests. Crucially, we provide a deep physicochemical interpretation of the descriptors, unifying their influence under the framework of free volume theory. We show that key descriptors govern Tg by modulating the fractional free volume through distinct mechanisms: descriptors like Chi0n increase free volume by introducing molecular branching that disrupts chain packing, while MinPartialCharge influences Tg through its effect on intermolecular interactions. This mechanistic understanding is translated into clear molecular design guidelines, distinguishing strategies for achieving high-Tg versus processable, low-Tg polymers. Our work establishes a reliable and transparent computational tool that bridges data-driven prediction with fundamental chemical insight for accelerating PIs development.

1. Introduction

Polyimides (PIs) represent a cornerstone of high-performance polymers, renowned for their exceptional thermal stability, mechanical strength, and excellent dielectric properties. These characteristics, coupled with vast structural diversity and molecular designability, have rendered them indispensable in demanding microelectronics applications, including organic light-emitting diode (OLED) displays, flexible printed circuit boards (FPCs), and advanced semiconductor packaging [1,2,3]. The glass transition temperature (Tg) is a critical parameter governing the upper service temperature of PIs, marking the transition from a glassy to a rubbery state where segmental chain motion commences. Rigid polymers exhibit higher Tg values due to restricted bond rotation, while flexible polymers have lower Tg values because of increased segmental mobility. Common flexible groups include methylene, ether, and sulfone bonds, among others; their presence in different positions and proportions leads to significant variations in polymer Tg values. Consequently, the accurate prediction and precise tuning of Tg are crucial for designing new PIs tailored for specific high-temperature operational environments.
Traditional experimental methods for determining Tg, such as differential scanning calorimetry (DSC) and dynamic mechanical analysis (DMA), are often resource-intensive and time-consuming, creating a bottleneck in the materials development cycle. While computational simulations like density functional theory (DFT) and molecular dynamics (MD) can provide valuable atomic insights [4,5,6], they face significant challenges. These include the high computational cost required to bridge the gap to experimental time and length scales, particularly for the slow segmental dynamics governing the glass transition in polymers. Furthermore, the results are sensitive to the chosen force fields and their parameterizations [7,8]. The growing demand for rapid polymer material design highlights the need for advanced computational evaluation and prediction capabilities. In this context, materials informatics—a data-driven paradigm leveraging machine learning (ML)—has emerged as a powerful complementary approach [9,10]. “Data-driven innovation” using artificial intelligence and materials genomics shows significant potential for breakthroughs in polymer science [11,12]. By establishing quantitative structure-property relationships (QSPR), ML models can rapidly predict properties like Tg directly from molecular structure, significantly accelerating the discovery and design of new polymers [13,14,15].
Over the past two decades, ML has provided predictive models validated by experiments, guiding precise material synthesis. Zhang et al. [16] developed an ANN-based QSPR model for the fast Tg prediction of PIs, with a 3.66% error rate verified experimentally. Li and colleagues [17] combined a QSPR model and experiments, finding that the NumRotatableBonds descriptor significantly influences the Tg of PIs. However, the prevailing focus in many QSPR studies has been on maximizing predictive accuracy. Models with high predictive power but low transparency fail to provide the crucial chemical and physical insights necessary for guiding molecular design. Many existing models, particularly those based on complex non-linear algorithms like deep neural networks, often function as “black boxes”, offering limited physicochemical interpretability. This lack of transparency hinders the extraction of fundamental design principles. Furthermore, the availability of high-quality, curated experimental Tg data for diverse PIs can be limited, posing a challenge for building robust and generalizable models [18,19]. Therefore, beyond mere prediction, there is a pressing need to develop QSPR frameworks that are not only predictive but also inherently interpretable, especially when working with modestly sized datasets. Such interpretability is crucial for transforming model predictions into actionable molecular-level understanding and reliable design guidelines.
In this work, we address this need by constructing a highly interpretable QSPR model for predicting the Tg of PIs. We employ a Genetic Algorithm (GA) for feature selection coupled with Multiple Linear Regression (MLR) for model building, a strategy specifically chosen for its robustness and transparency with limited data. Structural data and experimental Tg values of PIs are collected from public literature. The resulting parsimonious model utilizes only seven key molecular descriptors to achieve robust predictive performance. Beyond prediction, our primary objective is to provide deep physicochemical insight by meticulously interpreting these descriptors within the framework of polymer physics, unifying their influence under concepts such as free volume theory to elucidate their role in governing chain rigidity and intermolecular interactions. Ultimately, this study aims to deliver a reliable and transparent computational tool that not only predicts Tg but also provides clear, mechanistic guidance for the rational design of advanced polyimides. Workflow overview is shown in Figure 1.
Figure 1. Workflow overview. (a) Collection and compilation of experimental Tg values for diverse PIs. (b) Generation of molecular descriptors from chemical structures using RDKit. (c) Development and rigorous validation of the ML model for Tg prediction. (d) Extraction of physicochemical insights from the model coefficients to guide the rational design of polyimides.

2. Methodology

2.1. Data Collection and Processing

The experimental Tg values of 100 aromatic PIs were sourced from the comprehensive review by Ding [20]. This review provides a curated compilation of data from numerous primary literature sources, ensuring a consistent and reliable dataset for our modeling work. The full dataset, including chemical structures and original references, is provided in Table S1 of the Supporting Information. Aromatic PIs, characterized by a benzene ring conjugated within an imide pentacyclic structure, are inherently rigid, which underpins their utility in microelectronics. Our dataset encompasses a diverse range of dianhydride and diamine combinations, including commercially available formulations such as PMDA-ODA, systems with flexible segments (e.g., containing ether bonds or methylene groups), high Tg systems with rigid backbones (e.g., incorporating benzene and naphthalene rings), and low Tg systems with flexible chains.
To encode the molecular structures, the repeating unit of each PI was constructed and converted into a SMILES (Simplified Molecular Input Line Entry System) string using the MolToSmiles() function in Jupyter Notebook [21]. These SMILES strings were then transformed into .sdf files, a standard format that explicitly encodes 3D molecular coordinates and connectivity, which serves as input for subsequent descriptor calculation. A total of 208 molecular descriptors, encompassing structural, physicochemical, and topological features, were computed for each structure using the open-source RDKit cheminformatics toolkit (https://www.rdkit.org) [22]. Prior to model construction, all descriptors were standardized (mean-centered and scaled to unit variance) to ensure dimensional homogeneity and enable direct comparison of their regression coefficients.
The collected Tg values range from 377 K to 697 K, covering the entire range without major gaps. The distribution is approximately normal with a peak around 520 K (Figure S1). To ensure a representative data split, the dataset was divided into a training set (80%) and a test set (20%) using a systematic approach [23,24]. Specifically, the Tg values were first arranged in descending order. Then, every fifth data point was systematically selected to constitute the test set, with the remaining points used for training. This structured sampling prevents the test set from being clustered in a specific Tg interval and guarantees that the model is evaluated on data reflecting the overall variability of the dataset.

2.2. QSPR Modeling Based on Machine Learning

The QSPR methodology establishes a quantitative relationship between Tg values and molecular descriptors using machine learning (ML). Descriptor selection and model development were performed using a Genetic Algorithm (GA) [25] coupled with Multiple Linear Regression (MLR) [26]. This GA-MLR framework was strategically chosen for its superior performance with small datasets. Given our dataset size, complex non-linear models carry a high risk of overfitting. In contrast, the parsimonious nature of MLR enhances generalization capability by capturing robust, underlying trends rather than fitting noise [27]. Furthermore, the linear coefficients of the MLR model provide direct, quantitative insight into the influence and direction of each descriptor’s effect on Tg, ensuring superior interpretability compared to “black-box” non-linear alternatives. This combination achieves an optimal balance between predictive robustness and mechanistic interpretability for our Tg analysis of PIs [28].
We constructed QSPR models with 1 to 10 variables. Model performance was evaluated using the squared correlation coefficient (R2) and root mean square error (RMSE). The correlation coefficient of the training set ( R Train 2 ) and the corresponding root-mean-square error (RMSETrain) were calculated as measures of goodness-of-fit using Equations (1) and (2), respectively.
R T r a i n 2 = 1 i = 1 n y i y ^ i 2 i = 1 n y i y ¯ 2
RMSE Train = i = 1 n y i y ^ i 2 n
To ensure model reliability and robustness, the developed models (with 1–10 variables) were subjected to both internal and external validation. Given the small sample set, the optimal QSPR model was verified using Leave-One-Out Cross-Validation (LOOCV) technique [29]. The cross-validated correlation coefficient ( Q Train 2 ) and the corresponding root-mean-square error (RMSELOOCV) were calculated to assess predictive power and prevent overfitting (Equations (3) and (4)).
Q T r a i n 2 = 1 i = 1 n y i y ^ i ,   C V 2 i = 1 n y i y ¯ 2
R M S E L O O C V = i = 1 n y i y ^ i ,   C V 2 n
During the model verification process, it is important to perform external validation [30]. The model trained on the training set was used to predict the Tg of the held-out test set. The test accuracy ( R T e s t 2 ) and root mean square error (RMSETest) were calculated as follows (Equations (5) and (6)):
R T e s t 2 = 1 j = 1 k y j y ^ j 2 j = 1 k y j y ¯ 2
R M S E T e s t = j = 1 k y j y ^ j 2 k
Here, y i and y j are the experimental Tg values for the i-th training and j-th test molecule, respectively. y ^ i and y ^ i ,   C V are the predicted values of the training set and the cross-validation set, respectively. y ¯ and y ¯ are the average experimental Tg values of the training and test sets, respectively. The variables n and k denote the number of PIs in the training and test sets, respectively.

3. Results and Discussion

3.1. Development and Validation of the Optimal QSPR Model

We systematically developed QSPR models with 1 to 10 variables from the experimental Tg dataset to identify the most predictive yet parsimonious model. The complete statistical parameters and molecular descriptors for all candidate models are provided in Table S3, which substantiates the selection. As illustrated in Figure 2a, both the correlation coefficients of the training set ( R Train 2 ) and test set ( R T e s t 2 ) increase as the number of descriptors grows to 7. However, beyond this point, R T e s t 2 sharply declines while R Train 2 continues to improve marginally—a clear indicator of overfitting in the 8- to 10-variable models [31]. This divergence demonstrates that the 7-variable model achieves the optimal balance, leveraging the full predictive power of the descriptors without compromising generalizability. The model is defined by the following equation (see Table S4 for descriptor explanations):
Tg = −0.55Chi0n − 0.45PEOE_VSA7 − 0.38PEOE_VSA8 − 0.36SMR_VSA7 + 0.31MinPartialCharge +
0.072BCUT2D_CHGLO + 0.04SlogP_VSA10.
Figure 2. Using R2 and RMSE as a function of number of descriptors to evaluate the ML models. (a) Analysis of models with 1–10 descriptors by R2. (b) The best 7-variable model with the smallest RMSE (test set).
The reliability of the model depends on its robustness and predictive ability, which we assessed by evaluating the prediction accuracy and root mean square error (RMSE) for the training and test sets, along with the cross-validation coefficient and F-test results. As shown in Figure 2b, the 7-variable QSPR model, trained on 80 PI repeat-unit structures, accurately predicts the Tg of 20 unknown structures with an RMSETest of approximately 28.12 K (representing a relative error of about 5%). This RMSE is superior to those reported in previous research using larger datasets [17,32]. Given the complexity of Tg and the diversity of the datasets, the high level of correlation ( R Train 2 = 0.77 and R T e s t 2 = 0.74) between predicted and observed Tg demonstrates the strong predictive power of the 7-variable QSPR model. The R2 value indicates satisfactory predictive accuracy, for small datasets.
Figure 3a shows the correlation between experimental and predicted Tg using the 7-variable model. Figure 3b displays the Williams plot, which graphically identifies outliers and defines the application domain of the QSPR model. The Y-axis represents the standardized residual (σ), indicating the difference between experimental and predicted values, while the X-axis shows the leverage values (with a warning leverage value h* = 0.3). Observations with standardized residuals outside the range of −3σ to +3σ are considered anomalous. The leverage value (HAT) represents the extent to which a given structure influences the model. Structures with leverage value greater than h* (HAT > h*) signify greater influence on the model. Thus, the Williams plot in Figure 3b confirms all data points fall within the ±3σ limit, validating the reliability of the 7-variable model.
Figure 3. (a) Correlation plot of experimental and predicted Tg for PIs in the 7-variable QSPR model. (b) Williams diagram of standardized residuals (σ) and levers for training and test sets. Green dashed and solid black lines correspond to ±3σ and warning leverage value (h* = 0.3), respectively. (c) y-scrambling diagram: squares represent R2 and Q2 values of the 7-variable QSPR model, dots represent R2yscr and Q2yscr for a model based on random data. (d) Plot with deviations of selected molecular descriptors for PIs with highest error deviations.
We also note that some PIs (e.g., 1, 2, 7, 27, and 97) exhibit larger deviations than others. These discrepancies can be analyzed from multiple perspectives. On the one hand, given that the dataset division can affect model performance, we employed the repeatability measure from y-scrambling [33] to further validate model stability. The y-scrambling plots, generated by randomly permuting the experimental Tg values (i.e., the y-values), further verify the robustness and uniqueness of the optimal QSPR model (see Figure 3c). Each model underwent 2000 simulations, none of which exhibited satisfactory correlation. The best model (marked by a square in the plot) has significantly higher R2 and Q2 values than all other simulated models, confirming that the original model’s performance is not due to chance correlation. This reconfirms the robustness of the developed QSPR model. On the other hand, the estimation error of the descriptor coefficients for Tg may vary slightly among different PIs (see Figure 3d). Substantial deviations in the SlogP_VSA10, BCUT2D_CHGLO, and MinPartialCharge descriptors contribute significantly to the overall Tg residual value for PIs 1, 2, 7, 27, and 97. Therefore, it is essential to understand these key descriptors to address prediction bias and to effectively use the optimal QSPR model for Tg prediction.
Having established the statistical robustness of the 7-variable model, we now delve into the physicochemical significance of its descriptors to unravel the molecular mechanisms governing Tg.

3.2. Physicochemical Interpretation of the Key Molecular Descriptors

The developed model underscores that key molecular descriptors governing the Tg of PIs are primarily related to molecular topology, polarity, and van der Waals surface properties. Figure 4a quantifies the relative impact of each descriptor on Tg through the magnitude and sign (positive or negative) of its regression coefficient.
Figure 4. (a) Coefficients of different descriptors in the 7-variable model. (b) The mechanism of the molecular structures (No. 59 and No. 90) and descriptors in relation to Tg. The elements are carbon (grey), oxygen (red), nitrogen (dark blue), fluorine (light blue) and hydrogen (white).
MinPartialCharge [34] represents the minimum partial atomic charge in a molecule, reflecting its electronegativity and electrostatic character. A higher (less negative) value of MinPartialCharge is associated with an increased electrostatic potential, which strengthens intermolecular interactions. Consequently, this enhanced interaction impedes the movement of polymer segments, thereby elevating the Tg.
Chi0n [35,36] is a zero-order molecular connectivity index that quantifies the topological complexity arising from branching structures. The negative coefficient of this descriptor in our model establishes a clear structure-property relationship: an increase in Chi0n predicts a decrease in Tg. This correlation is rooted in how molecular topology governs the fractional free volume (FFV). Taking Structures No. 59 and No. 90 in the dataset as examples (Table S1), higher Chi0n value corresponds to a more highly branched and three-dimensional molecular architecture (see Figure 4b). These branched components serve as steric obstacles along the polymer backbone, which disrupt efficient chain packing and increase the system’s disordered free volume. According to free volume theory, the cooperative motion of chain segments requires sufficient space. The expanded FFV resulting from branched topology therefore provides more pathways for segmental rotation and movement, which directly lowers the energy barrier for the onset of this motion [37]. Macroscopically, this facilitation of chain mobility manifests as a reduction in the Tg. Furthermore, in the context of our dataset, the primary effect captured by Chi0n is steric. The introduction of multiple branches can dilute the density of strong intermolecular interaction sites (e.g., polar imide groups) per unit volume, which may further contribute to a lowering of the cohesive energy density [38]. In brief, for the PIs in our study, a high Chi0n value signifies a branched topology that increases free volume, reduces packing efficiency, and lowers the energy barrier for segmental motion, collectively resulting in a lower Tg.
PEOE_VSA7, PEOE_VSA8 and SMR_VSA7 are descriptors associated with the van der Waals surface area (VSA), each capturing distinct physicochemical attributes. In our QSPR model, their consistent negative coefficients indicate that increasing values of these descriptors correlate with a decrease in Tg, a trend substantiated by our structural analysis where high-descriptor-value structures exhibit lower Tg (497 K) and low-descriptor-value structures possess higher Tg (678 K) (see Table S1 and Figure 4b). SMR_VSA7 [39,40] quantifies the VSA contributions of atoms characterized by high molar refractivity (SMR), which corresponds to strong polarizability and dispersion forces. Contrary to an intuitive association with enhanced intermolecular cohesion, our data indicate that elevated SMR_VSA7 values often accompany structures with bulky, polarizable groups (e.g., -CF3, complex aromatic systems) that introduce significant steric hindrance. This hindrance primarily disrupts efficient chain packing, creating excess free volume which facilitates segmental motion and thereby lowers Tg. PEOE_VSA7 and PEOE_VSA8 [36] values are derived from the Gasteiger charge contribution to the atomic surface area within specific ranges in a molecule. Similarly, these descriptors represent the VSA of atoms with partial charges in the mildly negative (−0.05, 0) and mildly positive (0, 0.05) ranges, respectively. These regions typically correspond to low-polarity moieties. Higher values for these descriptors signify an expanded molecular surface area with these weakly polar characteristics. Rather than promoting strong, directed intermolecular interactions, such an expanded surface of low-polarity character appears to promote a looser, more disordered packing mode. This effect increases the fractional free volume, reducing the energy barrier for segmental motion and resulting in a lower Tg. In essence, these VSA-based descriptors serve as proxies for molecular features that sterically inhibit efficient packing. The resultant increase in free volume provides the fundamental link to the observed decrease in Tg.
Collectively, the molecular descriptors identified by our model converge on a unified physical mechanism that governs Tg: the modulation of FFV (Equation (7)). According to free volume theory, the total volume of a polymer is partitioned into the occupied volume V O and the free volume V ƒ , with the free volume fraction is described as follows:
F F V = V ƒ V O + V ƒ × 100 %
FFV represents the unoccupied space essential for the initiation of large-scale segmental motion. The key insight from our model is that structural features captured by the descriptors primarily increase the FFV. Specifically, highly branched topologies (high Chi0n) and bulky, polarizable surface groups (high SMR_VSA7, PEOE_VSA7/8) act as molecular-scale “spacers” that sterically disrupt efficient chain packing. This creates additional void space and a more open molecular architecture. Consequently, an increased FFV directly lowers the energy barrier that chain segments must overcome to initiate cooperative motion. This facilitated mobility macroscopically manifests as a decrease in the glass transition temperature. Thus, the amplification of FFV serves as the fundamental link between these molecular descriptors and the depression of Tg in our dataset of PIs.
Compared to the descriptors discussed above, BCUT2D_CHGLO and SlogP_VSA10 exhibit relatively lower absolute coefficients in the QSPR model, indicating a secondary, though still relevant, influence on Tg. BCUT2D_CHGLO [41] quantifies the polarity and heterogeneity of charge distribution within a molecule, reflecting the strength of intermolecular forces and chain rigidity. Higher values signify pronounced charge disparity, such as that found in strongly polar functional groups, whereas lower values correspond to non-polar molecules with homogeneous charge distribution [42]. SlogP_VSA10 [43] is a descriptor that integrates hydrophobicity and molecular surface area, quantifying the contribution of moderately hydrophobic regions. Hydrophobic groups can enhance inter-chain cohesion energy via van der Waals forces or hydrophobic interactions, thereby restricting chain mobility and consequently elevating the Tg. Additionally, an extensive hydrophobic surface area may also impede chain segment rotation, increasing molecular rigidity. However, for the targeted tuning of Tg, modifying the surface area of moderately hydrophobic groups appears to be less effective than modulating descriptors with larger coefficients, such as those governing backbone rigidity and strong polar interactions.
A closer examination of the prediction outliers (e.g., PI units 1, 2, and 7) helps to define the boundaries of our model’s applicability. These structures often possess unique steric effects or specific intra-molecular interactions (e.g., ortho-substitution, complex conformational isomerism) that our current set of global descriptors does not fully capture. In these cases, descriptors like BCUT2D_CHGLO and SlogP_VSA10 appear to exert a stronger influence, leading to their identification as outliers in the Williams plot (Figure 3b). Analyzing these outliers provides valuable direction for future model refinement, suggesting the potential inclusion of more specific descriptors for steric hindrance or conformational energy.

3.3. Mechanistic Interpretation, Molecular Design Guidance and Future Perspectives

The primary contribution of this work extends beyond the robust predictive performance of the QSPR model on a dataset of this size but, it lies in the model’s exceptional interpretability, which provides direct molecular-level insights. The parsimonious 7-variable model, derived from a rigorous GA-MLR framework, deciphers the key physicochemical factors governing Tg. This demonstrates that a strategically developed, interpretable model on a focused dataset can serve as a highly reliable and informative tool for guiding molecular design, even without an extensive, large training set.
By synthesizing the interpretations of the key descriptors, we can distill concrete design principles for PIs. To achieve a high-Tg material, priority should be given to structures that enhance chain rigidity and promote efficient packing. This involves incorporating linear, rigid aromatic and multi-cyclic monomers (e.g., pyromellitic dianhydride, naphthalene-based units) which are associated with low Chi0n values, and introducing strong polar groups (e.g., carbonyl) or moieties capable of forming hydrogen bonds, which will increase descriptors like MinPartialCharge. These strategies collectively strengthen the cohesive energy density and suppress the FFV, thereby raising the energy barrier for segmental motion. Conversely, to obtain a lower Tg for improved processability, the molecular design should introduce structural features that disrupt chain packing and increase free volume. This can be achieved by introducing flexible linkages (e.g., ether bonds, methylene chains) or branched, three-dimensional topologies, which lead to higher Chi0n values, and incorporating bulky, polarizable side groups (e.g., -CF3) that increase SMR_VSA7 and PEOE_VSA7/8 values, thereby introducing steric hindrance. These modifications increase the available free volume and enhance chain segment mobility. Thus, our model transitions from a predictive tool to a prescriptive guide, offering a quantitative and interpretable framework for the rational design of PIs.
It is also important to acknowledge the model’s current limitations and its potential for further enhancement. First, compared with some better prediction results [7,44], the small dataset we used limited the generalization and robustness of the model to some extent. To address this issue, we used all experimentally derived Tg values into dataset and adopted LOOCV technology to mitigate the impact of limited data on model generalization. Second, incorporating chemical structures and more physics-based descriptors during the creation of representative tests [17,45] can enhance system evaluation and facilitate the analysis of key features. Future work will focus on building a community-shared polyimide database to facilitate this. Beyond data volume, the model’s interpretability opens several exciting avenues: (1) Integrating descriptors from quantum mechanical calculations to capture more subtle electronic effects; (2) Employing inherently interpretable geometric deep learning models that operate directly on molecular graphs to automate feature extraction for complex structure; (3) Utilizing the established QSPR as a rapid screening tool in multi-objective optimization workflows, simultaneously balancing Tg with other critical properties like modulus or dielectric constant.

4. Conclusions

In this study, we have developed a robust and highly interpretable QSPR model for predicting the Tg of PIs. The parsimonious 7-variable model, established through a GA-MLR framework and based on only seven molecular descriptors, achieves an optimal balance between predictive accuracy ( R T e s t 2 = 0.74) and generalization ability. Its primary contribution lies in the deep physicochemical interpretation of these descriptors, which coherently demonstrates that Tg is governed by molecular features that modulate chain mobility through the control of fractional free volume. This occurs through two primary pathways: the introduction of topological branching (captured by descriptors like Chi0n) that increases free volume and reduces Tg, and the enhancement of intermolecular interactions (reflected by descriptors such as MinPartialCharge) that can restrict chain mobility. This mechanistic understanding translates into clear molecular design principles: incorporating rigid, linear structures for high-Tg applications, and introducing branched topologies or flexible linkages to enhance processability. Our work establishes a reliable and transparent computational tool for accelerating the development of advanced PIs, underscoring the critical value of interpretable ML in bridging data-driven prediction with fundamental materials science.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/ma18245541/s1, Figure S1: Experimental Tg value distribution of PIs; Table S1: The chemical structures (represented by SMILES), the reported Tg value and descriptors of PIs from reference; Table S2: Comparison of common Tg determination methods in MD simulation; Table S3: Descriptors and statistical analysis parameters associated with each model; Table S4: 7-variable model descriptors and their interpretations.

Author Contributions

Conceptualization, methodology, software, formal analysis, resources, writing—original draft preparation, writing—review and editing, funding acquisition, T.C.; validation, investigation, data curation, writing—original draft preparation, H.L.; visualization, X.L.; supervision, project administration, Y.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was financially supported by the Funding by Science and Technology Projects in Guangzhou (2025A04J3832), and National Natural Science Foundation of China (52001068).

Data Availability Statement

The original contributions presented in this study are included in the article/Supplementary Materials. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Li, Y.; Sun, G.; Zhou, Y.; Liu, G.; Wang, J.; Han, S. Progress in low dielectric polyimide film—A review. Prog. Org. Coat. 2022, 172, 107103. [Google Scholar] [CrossRef]
  2. Rodriguez, K.M.; Lin, S.; Wu, A.X.; Storme, K.R.; Joo, T.; Grosz, A.F.; Roy, N.; Syar, D.; Benedetti, F.M.; Smith, Z.P. Penetrant-induced plasticization in microporous polymer membranes. Chem. Soc. Rev. 2024, 53, 2435–2529. [Google Scholar] [CrossRef] [PubMed]
  3. Liu, X.-J.; Zheng, M.-S.; Chen, G.; Dang, Z.-M.; Zha, J.-W. High-temperature polyimide dielectric materials for energy storage: Theory, design, preparation and properties. Energy Environ. Sci. 2022, 15, 56–81. [Google Scholar] [CrossRef]
  4. Zhang, X.; Li, X.; Li, L.; Shi, T. Effect of methyl trifluoride substitution on colorless transparency of polyimide: A DFT/TD-DFT study. J. Mol. Liq. 2024, 411, 125691. [Google Scholar] [CrossRef]
  5. Lei, H.; Bao, F.; Peng, W.; Qiu, L.; Zou, B.; Huang, M. Torsion effect of the imide ring on the performance of transparent polyimide films with methyl-substituted phenylenediamine. Polym. Chem. 2022, 13, 6606–6613. [Google Scholar] [CrossRef]
  6. Lin, D.; Li, R.; Liu, Y.; Qi, S.; Wu, D. Clarifying the effect of moisture absorption and high-temperature thermal aging on structure and properties of polyimide film at molecular dynamic level. Polymer 2021, 214, 123251. [Google Scholar] [CrossRef]
  7. Qiu, H.; Qiu, X.; Dai, X.; Sun, Z.Y. Design of polyimides with targeted glass transition temperature using a graph neural network. J. Mater. Chem. C 2023, 11, 2930–2940. [Google Scholar] [CrossRef]
  8. Zhou, S.W.; Yu, C.; Chen, M.; Shi, C.Y.; Cu, R.; Qu, D.H. Self-healing and shape-shifting polymers controlled by dynamic bonds. Smart Mol. 2023, 1, e20220009. [Google Scholar] [CrossRef]
  9. Jansen, S.A.H.; Vantomme, G.; Meijer, E.W. Evolving data-driven strategies for the characterization of supramolecular polymers and systems. Angew. Chem. Int. Ed. 2025, 64, e202509122. [Google Scholar] [CrossRef]
  10. Zhen, Z.; Potta, T.; Christensen, M.D.; Narayanan, E.; Kanagal, K.; Breneman, C.M.; Rege, K. Accelerated materials discovery using chemical informatics investigation of polymer physicochemical properties and transgene expression efficacy. ACS Biomater. Sci. Eng. 2019, 5, 654–669. [Google Scholar] [CrossRef]
  11. Ni, X.; Amamoto, Y.; Kikuchi, J. Simultaneous multimodal and multitask strategies for diverse biodegradable polymers powered by NMR data science. Sustain. Mater. Tech. 2025, 47, e01781. [Google Scholar] [CrossRef]
  12. Ohshida, T.; Wada, T.; Kaneuchi, T.; Ohno, E.; Taniike, T. Data-driven phase analysis of immiscible polymer blends via multi-parameter AFM imaging. Polymer 2026, 343, 129391. [Google Scholar] [CrossRef]
  13. Miccio, L.A.; Schwartz, G.A. From chemical structure to quantitative polymer properties prediction through convolutional neural networks. Polymer 2020, 193, 122341. [Google Scholar] [CrossRef]
  14. Yu, M.; Shi, Y.; Jia, Q.; Wang, Q.; Luo, Z.H.; Yan, F.; Zhou, Y.N. Ring repeating unit: An upgraded structure representation of linear condensation polymers for property prediction. J. Chem. Inf. Model. 2023, 63, 1177−1187. [Google Scholar] [CrossRef]
  15. Tao, L.; Varshney, V.; Li, Y. Benchmarking machine learning models for polymer informatics: An example of glass transition temperature. J. Chem. Inf. Model. 2021, 61, 5395−5413. [Google Scholar] [CrossRef] [PubMed]
  16. Zhang, S.; He, X.; Xia, X.; Xiao, P.; Wu, Q.; Zheng, F.; Lu, Q. Machine-learning-enabled framework in engineering plastics discovery: A case study of designing polyimides with desired glass-transition temperature. ACS Appl. Mater. Interfaces 2023, 15, 37893–37902. [Google Scholar] [CrossRef]
  17. Zhang, T.; Wang, S.; Chai, Y.; Yu, J.; Zhu, W.; Li, L.; Li, B. Prediction and interpretability study of the glass transition temperature of polyimide based on machine learning with quantitative structure−property relationship (Tg–QSPR). J. Phys. Chem. B 2024, 128, 8807–8817. [Google Scholar] [CrossRef]
  18. Xu, P.; Ji, X.; Li, M.; Lu, W. Small data machine learning in materials science. npj Comput. Mater. 2023, 9, 42. [Google Scholar] [CrossRef]
  19. Audus, D.J.; de Pablo, J.J. Polymer informatics: Opportunities and challenges. ACS Macro Lett. 2017, 6, 1078–1082. [Google Scholar] [CrossRef]
  20. Ding, M. Isomeric polyimides. Prog. Poly. Sci. 2007, 32, 623–668. [Google Scholar] [CrossRef]
  21. Weininger, D. SMILES, A chemical language and information system. J. Chem. Inf. Comp. Sci. 1988, 28, 31–36. [Google Scholar] [CrossRef]
  22. Gozalbes, R.; Doucet, J.P.; Derouin, F. Application of topological descriptors in QSAR and drug design: History and new trends. Curr. Drug Targets-Infect. Disord. 2002, 2, 93–102. [Google Scholar] [CrossRef] [PubMed]
  23. Wen, C.; Liu, B.; Wolfgang, J.; Long, T.E.; Odle, R.; Cheng, S. Determination of glass transition temperature of polyimides from atomistic molecular dynamics simulations and machine-learning algorithms. J. Polym. Sci. 2020, 58, 1521–1534. [Google Scholar] [CrossRef]
  24. Joseph, V.R.; Vakayil, A. SPlit: An optimal method for data splitting. Technometrics 2022, 64, 166–176. [Google Scholar] [CrossRef]
  25. Devillers, J. Genetic Algorithms in Molecular Modeling; Academic Press: Cambridge, MA, USA, 1996. [Google Scholar]
  26. Ahmed, L.; Rasulev, B.; Turabekova, M. Receptor- and ligand-based study of fullerene analogues: Comprehensive computational approach including quantum-chemical, QSAR, and molecular docking simulations. Org. Biomol. Chem. 2013, 11, 5798–5808. [Google Scholar] [CrossRef]
  27. Karuth, A.; Alesadi, A.; Xia, W.; Rasulev, B. Predicting glass transition of amorphous polymers by application of cheminformatics and molecular dynamics simulations. Polymer 2021, 218, 123495. [Google Scholar] [CrossRef]
  28. Rasulev, B.F.; Saidkhodzhaev, A.I.; Nazrullaev, S.S.; Akhmedkhodzhaeva, K.S.; Khushbaktova, Z.A.; Leszczynski, J. Molecular modelling and QSAR analysis of the estrogenic activity of terpenoids isolated from ferula plants. SAR QSAR Environ. Res. 2007, 18, 663–673. [Google Scholar] [CrossRef] [PubMed]
  29. Wong, T.-T. Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recognit. 2015, 48, 2839–2846. [Google Scholar] [CrossRef]
  30. Ho, S.Y.; Phua, K.; Wong, L.; Goh, W.W.B. Extensions of the external validation for checking learned model interpretability and generalizability. Patterns 2020, 1, 100129. [Google Scholar] [CrossRef]
  31. Hawkins, D.M. The problem of overfitting. J. Chem. Inf. Comp. Sci. 2004, 44, 1–12. [Google Scholar] [CrossRef]
  32. Zhang, H.; Li, H.Y.; Xin, H.S.; Zhang, J.H. Property prediction and structural feature extraction of polyimide materials based on machine learning. J. Chem. Inf. Model. 2023, 63, 5473–5483. [Google Scholar] [CrossRef]
  33. Tropsha, A.; Gramatica, P.; Gombar, V.K. The importance of being earnest: Validation is the absolute essential for successful application and interpretation of QSPR models. QSAR Comb. Sci. 2003, 22, 69–77. [Google Scholar] [CrossRef]
  34. Gasteiger, J.; Marsili, M. Iterative partial equalization of orbital electronegativity—A rapid access to atomic charges. Tetrahedron 1980, 36, 3219–3228. [Google Scholar] [CrossRef]
  35. Toots, K.M.; Sild, S.; Leis, J.; Acree, W.E., Jr.; Maran, U. The quantitative structure-property relationships for the gas-ionic liquid partition coefficient of a large variety of organic compounds in three ionic liquids. J. Mol. Liq. 2021, 343, 117573. [Google Scholar] [CrossRef]
  36. Yang, A.; Sun, S.; Mi, H.; Wang, W.; Liu, J.; Kong, Z.Y. Interpretable feedforward neural network and XGBoost-based algorithms to predict CO2 solubility in ionic liquids. Ind. Eng. Chem. Res. 2024, 63, 8293–8305. [Google Scholar] [CrossRef]
  37. Luo, S.; Li, Y.; Li, N.; Cao, Z.; Zhang, S.; Ocheje, M.U.; Gu, X.; Rondeau-Gagne, S.; Xue, G.; Wang, S.; et al. Real-time correlation of crystallization and segmental order in conjugated polymers. Mater. Horiz. 2024, 11, 196–206. [Google Scholar] [CrossRef]
  38. Yang, M.; Zhou, L.; Li, X.; Ren, W.; Shen, Y. Polyimides physically crosslinked by aromatic molecules exhibit ultrahigh energy density at 200 °C. Adv. Mater. 2023, 35, 2302392. [Google Scholar] [CrossRef] [PubMed]
  39. Labute, P. A widely applicable set of descriptors. J. Mol. Graph. Model. 2000, 18, 464–477. [Google Scholar] [CrossRef] [PubMed]
  40. Wan, J.; Zhang, L.; Yang, G.; Zhan, C.G. Quantitative structure−activity relationship for cyclic imide derivatives of protoporphyrinogen oxidase inhibitors: A study of quantum chemical descriptors from density functional theory. J. Chem. Inf. Comp. Sci. 2004, 44, 2099–2105. [Google Scholar] [CrossRef] [PubMed]
  41. Burden, F.R. Molecular identification number for substructure searches. J. Chem. Inf. Comput. Sci. 1989, 29, 225–227. [Google Scholar] [CrossRef]
  42. Du, H.; Wang, J.; Hui, J.; Zhang, L.; Wang, H. DenseGNN: Universal and scalable deeper graph neural networks for highperformance property prediction in crystals and molecules. npj Comput. Mater. 2024, 10, 292. [Google Scholar] [CrossRef]
  43. Krevelen, D.W.; Te Nijenhuis, K. Properties of Polymers: Their Correlation with Chemical Structure; Their Numerical Estimation and Prediction from Additive Group Contributions; Elsevier: Amsterdam, The Netherlands, 2009. [Google Scholar]
  44. Zhang, B.; Li, X.; Xu, X.; Cao, J.; Zeng, M.; Zhang, W. Multi-property prediction and high-throughput screening of polyimides: An application case for interpretable machine learning. Polymer 2024, 312, 127603. [Google Scholar] [CrossRef]
  45. Volgin, I.V.; Batyr, P.A.; Matseevich, A.V.; Dobrovskiy, A.Y.; Andreeva, M.V.; Nazarychev, V.M.; Larin, S.V.; Goikhman, M.Y.; Vizilter, Y.V.; Askadskii, A.A.; et al. Machine learning with enormous “Synthetic” data sets: Predicting glass transition temperature of polyimides using graph convolutional neural networks. ACS Omega 2022, 7, 43678–43691. [Google Scholar] [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.