Next Article in Journal
Analysis of the Temperature Field in High-Rise Concrete Tower Structure
Previous Article in Journal
Study of the Resistivity of Concrete Modified with Recycled PET and Cane Bagasse Fiber to Facilitate the Cathodic Protection of Reinforcing Steel
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Prediction of Chloride Diffusion Coefficient in Concrete by Micro-Structural Parameters Based on the MLP Method by Considering Data Missing and Small Sample in Database

1
College of Civil Engineering, Zhejiang University of Technology, Hangzhou 310014, China
2
Zhejiang Academy of Building Research & Design Co., Ltd., Hangzhou 310025, China
3
Laboratory of Green Construction and Intelligent Operation & Maintenance for Coastal Infrastructure, Zhejiang University of Technology, Hangzhou 310014, China
4
School of Transportation, Shijiazhuang Tiedao University, Shijiazhuang 050043, China
*
Author to whom correspondence should be addressed.
Buildings 2026, 16(3), 513; https://doi.org/10.3390/buildings16030513
Submission received: 16 December 2025 / Revised: 15 January 2026 / Accepted: 17 January 2026 / Published: 27 January 2026
(This article belongs to the Special Issue Geopolymers and Low Carbon Building Materials for Infrastructures)

Abstract

Chloride diffusivity of concrete is essentially determined by its microstructural parameters. Establishing a reliable and accurate prediction model for chloride diffusion has become a research hotspot. In this study, a database containing 144 sets of macro–micro property parameters of concrete is established to train a Multilayer Perceptron (MLP) model. Taking the original collected data as a benchmark, data are randomly missing to simulate data incompleteness, and the models are trained using data filled by the Lagrange, K-Nearest Neighbor (KNN), and Miceforest methods. Moreover, the original data is expanded by the virtual sample generation (VSG) algorithm, based on a Gaussian mixture model (GMM) that fits the joint probability distribution of the original data to generate virtual samples preserving statistical (mean, standard deviation) and physical (e.g., porosity range, pore size ratio) consistency, thus mitigating the randomness caused by small sample sizes. Results indicate that the MLP model demonstrates excellent predictive performance: among schemes handling missing data, the model preprocessed by normalization with KNN imputation yields the best results with testing R2 of 0.78; the baseline model (without missing value filling, normalized) achieves testing R2 of 0.83, MAE of 0.572, and MSE of 0.424. VSG-expanded data significantly enhances the MLP model’s prediction accuracy. When expanding to 3000 groups, the testing R2 reaches 0.85, a 2.4% increase compared to 1000 groups, with further improvements as the dataset expands, confirming the feasibility of the VSG algorithm for small-sample scenarios.

1. Introduction

Concrete is an inhomogeneous porous material with internal pores of varying shapes and sizes [1]. External mediums, such as water, CO2 and chloride ions, can enter into the interior of concrete through microscopic cracks or pores. If concrete is dense, external mediums enter into the internal concrete slowly and exhibit a low permeability, indicating that medium transport in concrete depends on the microstructure of concrete [2]. Recent studies have further demonstrated that cyclic mechanical loading can significantly modify the microstructural characteristics of concrete—such as pore coarsening and crack opening—which subsequently accelerates chloride-ion transport under service conditions [3]. Chloride attack is the main cause of structural deterioration of the reinforced concrete structure. It has become a research hotspot to establish a nonlinear relationship between microscopic pore structure and macroscopic chloride diffusivity of concrete. Microstructure-informed numerical models have also been developed to simulate chloride transport by explicitly reconstructing pore structures and computing effective diffusivity through multiscale simulation frameworks [4]. Previous researchers have pointed out that the microstructural parameters such as porosity, pore size distribution, pore connectivity and tortuosity are well correlated with the chloride diffusion coefficient of concrete [5,6,7], and some mathematical-physical models or empirical formulas have been established and verified to a certain extent in practice [8,9,10]. As the volume fraction of pores in concrete, porosity directly determines the number of chloride transport channels. Higher porosity increases diffusion paths, leading to higher chloride diffusivity. Pores are categorized by size. Gel pores hinder chloride diffusion due to their small size and high tortuosity, while capillary/macropores accelerate diffusion by providing direct channels. Tortuosity is defined as the ratio of the actual chloride transport path length to the straight-line distance, which reflects pore connectivity. Higher tortuosity lengthens transport paths, reducing the chloride diffusion coefficient. Notably, these parameters do not act independently—synergistic interactions (e.g., high porosity combined with low tortuosity vs. low porosity with high tortuosity) lead to complex nonlinear relationships between the microstructure and chloride diffusion coefficient. Recent reviews have highlighted a growing research interest in incorporating environmental exposure factors into chloride transport modeling to capture time-varying surface chloride concentrations [11,12].
It is worth noting that idealized or simplified mathematical-physical models cannot describe the relationship between the complex microstructure of concrete and its macroscopic properties well. Moreover, most empirical formulas are only valid for specific concrete mixtures quantitatively, and universal results are not easy to achieve. In recent years, the booming development of artificial intelligence technology has provided new ideas for predicting the durability of concrete, and the machine learning method has been widely used due to its high prediction accuracy and fast computation rate [13]. Contemporary machine learning strategies—especially ensemble learning and data-driven optimization—have been shown to markedly enhance prediction accuracy for chloride diffusion by capturing nonlinear deterioration patterns [14]. Artificial Neural Network (ANN) is one of the most widely used machine learning methods. Work has expanded ANN-based approaches to incorporate coupled deterioration mechanisms, such as mechanical loading and freeze–thaw cycles, exhibiting robust performance under variable environmental conditions [15]. Liu et al. [16] established a database containing 653 groups of chloride diffusion coefficients through literature review, constructed a prediction model of chloride diffusion coefficient based on the ANN method, and validated the prediction model by statistical analysis. Boga et al. [17] established a database of concrete containing ground-granulated blast furnace slag (GGBFS) and calcium nitrite-based corrosion inhibitor (CNI) based on 162 groups of experimental data, and developed prediction models for concrete compressive strength, splitting tensile strength and electric flux based on the ANN and Adaptive Neuro-Fuzzy Inference System (ANFIS) method. Ahmad et al. [18] predicted the surface chloride concentration of concrete by Gene Expression Programming (GEP), Decision Tree (DT), and ANN, based on data with 12 input parameters. Most of these studies have selected concrete mix components as the main input variables, which limits the general applicability of the trained machine learning models. Additionally, the model structure constructed by ANN with only one hidden layer has a good performance in dealing with simple problems or some specific problems, but may encounter difficulties in dealing with complex problems due to insufficient generalization ability [19]. Robust chloride diffusion prediction under mechanical and thermal actions requires deeper architectures, as shallow networks often struggle with interacting deterioration mechanisms [15]. Deep Neural Networks (DNN) can solve complex pattern classification problems with a good generalization ability [20]. Typical DNN methods include Multilayer Perceptron (MLP), Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), and Deep Belief Networks (DBN). Among them, as the most basic DNN structure, MLP can form CNN and BP neural networks by adding convolutional layers and backpropagation. With more layers and neurons, MLP enhances the capability to extract information features and has been widely applied in civil engineering.
The size and quality of databases have significant influences on the prediction results by the machine learning methods [21]. When predicting chloride diffusivity based on the microstructural parameters of concrete by the machine learning method, a large number of experiments are required. Due to time and cost constraints, the availability of effective data is very limited. Microstructural parameters rely on precision instruments with strict operational requirements. Testing errors or instrument downtime often leads to missing data points. Chloride diffusivity modeling studies have similarly demonstrated that small datasets substantially constrain model reliability, rendering virtual sample generation indispensable [22]. Moreover, different scholars have different research focuses, which makes a deficient quality and completeness of data and leads to an insufficient accuracy of machine learning models. Cross-study data integration to expand sample size is hindered by diverse research focuses, as different studies prioritize different parameters, leaving critical microstructural data incomplete. As we know, limited data leads to insufficient representation of the parameter space, causing models to overfit to noise or specific subgroups. Incomplete data disrupts the correlation between microstructural parameters and chloride diffusivity parameters. Simple handling such as deleting missing data further reduces sample size, while improper imputation such as mean filling distorts true relationships. Therefore, these challenges highlight the need for targeted strategies (e.g., robust imputation, virtual sample generation) to improve model reliability in concrete durability research. The missing value filling method can fill the incomplete data, which helps machine learning models to complete training [23]. The virtual data generation technique can generate a sufficient number of databases with good quality at a relatively low cost [24,25]. Chloride modeling reviews underscore the necessity of synthetic data for more accurate capture of the variability in marine surface chloride deposition [26]. The generated virtual database can be combined with machine learning methods to form a complete and general small-sample machine learning framework to improve the accuracy of the constructed model for small samples. Data augmentation methods also assist ML-MICP integrated durability studies, where biological treatment mechanisms produce inherently small datasets [27].
Despite the progress in existing studies, several critical research gaps remain unaddressed, highlighting the need for the current work. First, most ML models for chloride diffusivity prediction rely on concrete mix proportions as inputs [16,18], which are project-specific and limit generalizability. In contrast, microstructural parameters (porosity, pore size distribution, tortuosity), the direct determinants of chloride transport [7], are underutilized as core inputs, especially in small-dataset scenarios. Second, small sample sizes and missing microstructural data are prevalent in concrete durability research [22] due to the high cost and long duration of exposure tests, but few studies integrate both imputation and virtual sample generation to mitigate these dual challenges. Third, shallow ANN or traditional ML methods (e.g., SVM, random forests) used in previous work struggle to capture the nonlinear interactions between multiple microstructural parameters [15], leading to insufficient prediction accuracy. Fourth, existing studies do not clearly justify why a specific ML model is better suited for microstructural parameter-based prediction, particularly in the context of small samples or missing data.
MLP is uniquely suited for the current study due to the superior nonlinear fitting for interactive microstructural parameters, compatibility with small samples and VSG expansion, and robustness to imputed data noise [28,29]. To fill these gaps, this study proposes an MLP-based prediction model for concrete chloride diffusivity, with distinct advantages over previous methods. (1) It uses microstructural parameters (porosity, pore size distribution of different ranges, exposure time) as core inputs, ensuring broader generalizability across concrete types. (2) It integrates Lagrange, KNN, and Miceforest imputation methods with VSG-based data expansion to simultaneously address small samples and missing data, an approach rarely reported in the existing literature. (3) It adopts MLP with two hidden layers, which outperforms shallow ANN and traditional ML methods in capturing complex transport-governing relationships. (4) MLP’s flexibility, combined with VSG expansion, effectively reduces randomness caused by small datasets, enhancing model robustness. This study aims to provide a more reliable and practical prediction framework for concrete chloride diffusivity, addressing key unmet needs in existing research. A database containing 144 sets of macro–micro property parameters of concrete was established based on the exposure test in a marine environment and simulated experiments in a laboratory environment [30,31,32,33,34]. Among these, 10 sets were randomly selected as testing samples and the remaining 134 sets were used as training samples. Taking the original collected data as a benchmark, data is randomly missing to simulate the situation of data incompleteness. This treatment aligns with recent chloride diffusivity ML studies that generated artificial missing-data scenarios to test model robustness [22]. Then, the models are trained by data filled from different algorithms, and the effect of missing value filling methods on prediction accuracy is compared. Moreover, the original collected data is expanded by the virtual sample generation (VSG) algorithm, and the feasibility of the VSG algorithm is verified. This study aims to improve the prediction accuracy of chloride diffusivity in concrete by utilizing an MLP model, addressing the challenges of small sample sizes and missing data via VSG and imputation methods.

2. Methods

2.1. MLP

2.1.1. Basic Concepts

As a type of ANN, the MLP is capable of simulating the highly parallel connection and transmission structure between neurons in the biological brain [35]. Its architecture is flexible but generally consists of several neurons, including an input layer, one or more hidden layers, and an output layer. The hidden layer can contain several neurons, with full connectivity between the next and previous layer, and each neuron is connected with different weights. Figure 1 shows the schematic diagram of the MLP model. Contemporary machine learning-based durability research has revealed that rational optimization of the MLP network structure and depth can substantially improve their performance in capturing nonlinear chloride transport mechanisms, particularly when variables related to microstructure are adopted as model inputs [14,36].
According to the previous research results, a model containing two hidden layers with 16 neurons in each hidden layer is used in this paper. Supposing the input layer variables to be x 1 , x 2 , , x n , the output of one neuron can be represented by Equation (1):
y = f i = 1 n ω i x i + b = f W T X + b
where y is the output of the neuron, f(x) is the nonlinear activation function, wi is the weight of ith input variable xi, and b is the bias.
Prior to finalizing the MLP architecture, a systematic sensitivity analysis was conducted to evaluate the impact of network depth and width on prediction accuracy. The tested combinations included hidden layers (1, 2, 3, 4) and neurons per layer (8, 16, 32, 64). Performance was quantified using the comprehensive indicator OBJnew (mentioned in the later content) and 10-fold cross-validation. Results showed the following: (1) for 1 or 2 hidden layers, increasing neurons from 8 to 16 improved OBJnew by 10.2–12.3%, but further increasing to 32/64 neurons caused overfitting; and (2) a single hidden layer yielded R2 of 0.68, while 3/4 hidden layers led to overfitting. Thus, the two-hidden-layer (16 neurons each) architecture was selected for its optimal balance of complexity and generalization.
To mitigate overfitting under small-sample conditions, three measures were implemented: (1) 10-fold cross-validation to reduce sampling bias; (2) adoption of the SGD optimizer to avoid over-parameterization; (3) performance consistency checks across cross-validation folds, confirming that the selected architecture outperforms deeper networks in testing accuracy. This aligns with SSI-related studies, where shallow networks are preferred for small datasets to avoid overfitting to complex but sparse interaction patterns.
The Sigmoid function, Tanh function, and ReLU function are the commonly used activation functions. The Sigmoid function is a popular and traditional nonlinear function which squashes the output between [0, 1]. The output of the Sigmoid function is saturated for higher and lower inputs, which leads to the vanishing gradient problem. The Tanh function is similar to the Sigmoid function while exhibiting the zero centric property. The Tanh function also squashes the inputs, but in the range of [−1, 1]. It is worth noting that Sigmoid and Tanh functions majorly suffer from vanishing gradient, which will hinder the model training. The ReLU function is a simple function which is the identity function for positive input and zero for negative input, i.e., the range of ReLU is [0, ∞). The ReLU function solves the problem of computational complexity of the Logistic Sigmoid and Tanh functions. The downside of ReLU is with the vanishing gradient problem for the negative inputs. Therefore, the ReLU function is chosen as the activation function in this paper.
The propagation process of MLP includes two parts, which is feedforward and feedback. The details are as follows, starting from the input layer, the outputs of all neurons in the previous layer are used as the inputs of a certain neuron in the next layer with the sum of different weights, and finally propagate to the output layer. Then, after completing the propagation from the input layer to the output layer, the weights of each neuron are updated again through the loss function and optimizer, to achieve the minimum deviation between predicted value and actual value. Batch Gradient Descent (BGD), Stochastic Gradient Descent (SGD) and Adaptive Moment Estimation (Adam) are widely used optimizers. Due to its simplicity, lower computational cost, and fewer parameter adjustments, SGD is chosen as the optimizer in this paper.

2.1.2. Variable Identification and Data Preprocessing

According to the test data results, the input variables for the MLP model are set to be (1) microstructural parameters of concrete, including porosity, contributing porosity of different pore sizes (<20 nm, 20–50 nm, 50–200 nm, >200 nm); (2) control variables, i.e., exposure time, and chloride diffusion coefficient is set as the output variable. Microstructure-resolved numerical chloride models likewise emphasize selecting pore-scale descriptors such as porosity distribution, connectivity, and ITZ width to improve predictive accuracy [4].
Exposure time is treated as a control variable in the current study, primarily due to the discrete and uneven time intervals (3 months, 6 months, 1 year, 3 years, 5 years) in the original dataset. The sparse sequential data is insufficient to train time-aware learning architectures that require continuous time-step observations to capture temporal dependencies. However, the MLP model still effectively captures the time-dependent trend of chloride diffusivity, with a Pearson correlation coefficient of 0.76 between predicted diffusivity and exposure time in the testing set.
As recommended by recent studies [37], time-aware architectures hold significant potential for improving dynamic prediction accuracy by explicitly modeling the sequential evolution of chloride diffusion. Future work will expand the dataset with high-frequency sequential measurements to support LSTM/temporal CNN training, enabling the capture of cumulative environmental effects and microstructural degradation over time.
To avoid the large gap among different parameter values and improve the model training accuracy, input and output variables should be preprocessed, which can be realized by standardization, interval scaling, and normalization. In this paper, the data is processed in batches of one-tenth of the training samples, by using standardization and normalization to avoid numerical problems caused by large order of magnitudes differences in data, improve the convergence rate and prevent saturation of neurons [38]. The normalized processing is shown in Equation (2):
y n = x x min x max x min
where yn is the normalization result, xmin and xmax are the minimum and maximum value in the sample, respectively, and x is the sample value needed to be normalized.
After inputting the normalization results into the model, the obtained output needs to be inversely normalized with the formula shown in Equation (3):
y n = y n x max x min + x min
where yn is the actual output data, and y′n is the data after the inverse normalization.
The standardized processing is shown in Equation (4):
y s = x μ σ
where ys is the standardization result, x is the sample value to be standardized, μ is the mean value of the sample, and σ is the standard deviation of the sample.

2.1.3. Data Division

The training and testing sets are partitioned using a 90-10 ratio, with the original 144 samples divided into 134 training samples (90%) and 10 testing samples (10%). This ratio is determined by the small dataset characteristics: a 90% training ratio ensures sufficient data for the MLP model to learn nonlinear microstructural–diffusivity relationships, while the 10% testing set (stratified randomly) preserves the proportional distribution of key variables to reliably evaluate generalization [39].
To reduce overfitting and minimize sampling bias, the 134 training samples are further validated using 10-fold cross-validation, implemented as follows:
(1)
The training set is randomly split into 10 mutually exclusive folds, maintaining the same proportion of exposure time and microstructural types as the full training set.
(2)
For each iteration, 9 folds are used to train the MLP model, and 1 fold serves as the validation subset to evaluate performance. This process is repeated 10 times, with each fold used as the validation subset exactly once.
(3)
Model hyperparameters are optimized based on average validation performance across all folds, ensuring the model balances complexity and generalization.
(4)
The final training performance is reported as the average of metrics across all 10 validation subsets, reducing randomness from single split and improving result robustness [40,41].

2.1.4. Model Performance Evaluation Indicators

Model predictions validity is verified by different statistical parameters, such as the coefficient of determination R2, mean absolute error MAE and mean square error MSE, and the corresponding formulas are as follows:
R 2 = 1 i = 1 y i y i 2 i = 1 y i y i 2
M A E = 1 m i = 1 m y i y ^ i
M S E = 1 m i = 1 m y i y ^ 2 2
where m is the number of samples, and y ^ i , y ¯ i , y i represent the model prediction value, true value, and mean value of the sample, respectively. Generally speaking, the smaller the MSE and MAE are, the closer R2 is to 1, indicating that the model has a good predictive performance. However, single traditional metrics have inherent limitations. R2 only reflects correlation without considering error magnitude, MSE overemphasizes extreme errors, and MAE ignores model fit, often leading to conflicting conclusions when comparing multiple models. For example, a model with higher R2 may have larger MAE, making it difficult to determine overall superiority. Additionally, traditional metrics rarely integrate training and testing performance, risking overestimation of generalization ability.
To compare the performance between different models and avoid the appearance of opposite result for each single indicator, this paper proposes an improved comprehensive evaluation indicator based on the OBJ indicator proposed by Golafshani et al. [42], as shown in Equation (8). The proposed new indicator OBJnew considers the influences of MSE instead of RMSE which is used in Ref. [42], and the same weight in training samples and testing samples.
O B J new = 1 2 M A E tra + M S E tra 1 + R tra 2 + 1 2 M A E val + M S E val 1 + R val 2
where the subscripts tra and val represent the training sample and testing sample, respectively. From Equation (8), it can be observed that the decreasing MAE and MSE and the increasing R2 can both lead to a decreasing value of the OBJnew indicator. Therefore, a smaller value of OBJnew refers to a better model prediction performance, balancing fit, error control, and generalization ability, making it more effective than traditional single metrics for comparing the complex schemes (imputation, preprocessing, data expansion) in this study.

2.2. Missing Value Filling

Due to improper storage of experimental data, accidents occur in the experimental process, and due to other reasons, the collected data may be missing, which leads to an incomplete database and poor data quality. To simulate this scenario, the microstructural parameters within the 134 training sample groups were randomly set as missing at a rate of 40%. The intact data (without missing values) were retained as the training set. The complete dataset was then reconstructed using different imputation methods under two data arrangement schemes: exposure-time arrangement (sorted from shortest to longest exposure time) and random arrangement (sorted randomly). Finally, the corresponding models were trained to analyze their prediction performance.

2.2.1. Lagrange Filling

The Lagrange polynomial interpolation method considers that, given a polynomial that passes through exactly a few known points in a two-dimensional plane, the polynomial with the smallest number of degrees is used to construct a smooth curve that passes through all the known points.
For n known points in the plane (a line without two points), a polynomial with degree of n − 1 can be found:
y = a 0 + a 1 x 2 + a 2 x + + a n 1 x n 1
This polynomial is designed so that the corresponding curve passes through these n points, i.e., substituting the coordinates of the n points (x1, y1), (x2, y2) … (xn, yn) into the polynomial:
y 1 = a 0 + a 1 x 1 + a 2 x 1 2 + a n x 1 n 1 y 2 = a 0 + a 1 x 2 + a 2 x 2 2 + a n x 2 n 1 y n = a 0 + a 1 x n + a 2 x n 2 + a n x n n 1
Then, the Lagrange polynomial can be solved as follows:
L x = i = 0 n y i j = 0 , j i n x x j x i x j

2.2.2. K-Nearest Neighbor (KNN) Filling

The basic idea of KNN is that, according to the similarity of Euclidean distances between the samples, k training samples that are closest to the test samples in the Euclidean space are selected, and the regression value is the mean or the weighted value of the k nearest neighboring samples. Supposing that there are n training samples, each of which can be represented as
Xi = (Xi1, Xi2, …, Xin, yi), in, the Euclidean distance d between the training sample Xi and the test sample Xt = (Xt1, Xt2, …, Xtn, yt) can be described as
d X i , X t = m = 1 d x i m x t m 2 + y i y t 2
Calculate the Euclidean distance between all the training samples and the test samples according to Equation (12) and find the first K neighboring samples Xj′ = (Xj1′, Xj2′, …, Xjn′, yj′), jK of Xt, then the regression value of test sample is
y ^ t = 1 K j = 1 K y j

2.2.3. Miceforest Filling

Miceforest is a multiple imputation method for chained equations based on random forests. The basic process of RF regression is to generate P mutually independent subsets of samples Li from the training set L in a randomized, putative, re-selective manner, as shown in Equation (14),
L = { ( X k , Y k ) } k = 1 N L i = { ( X k , Y k ) } k = 1 N b ( i = 1 , 2 , 3 , , P )
where N represents the amount of data in the training set and Nb represents the number of sample subsets.
The Nb (Nb < N) samples in any sample subset Li are independent and obey a unified distribution, and the corresponding regression tree sub-model is built separately on each sample subset. Unlike the RF classification model, the dependent variable of the RF regression model is numerical. Therefore, for any sample X, P sub-models will produce P predicted values, which will be used to find the model output Y ^ E using the simple averaging:
Y ^ E = 1 P k = 1 P Y ^ k

2.2.4. Physical Constraints for Imputed Microstructural Parameters

To ensure that imputed microstructural parameters comply with physical constraints related to pore connectivity, continuity, and tortuosity, the following measures were implemented for all filling methods:
(1)
Pore continuity constraint: The sum of contributive porosities was forced to equal the total porosity of each sample, as total porosity is the aggregate of pores across all size ranges.
(2)
Parameter range constraint: Imputed values were clipped to the physically feasible ranges observed in the original experimental dataset.
(3)
Pore connectivity and tortuosity constraints: Pore connectivity was constrained to 0.13–0.45 (the range of the original dataset), and tortuosity was verified to fall within 1.17–1.35.
For the Miceforest method, a physical constraint layer was integrated into the iterative chained equations:
(1)
After each imputation iteration, values exceeding feasible ranges were clipped;
(2)
The sum of contributive porosities was checked against total porosity, with proportional adjustment of >200 nm porosity if deviation exceeded 2%;
(3)
Pore connectivity ratio was included as a hidden variable in the random forest model, ensuring that imputed values inherit the connectivity–total porosity correlation from the original data.
Post-imputation validation with 20 intentionally missing samples showed that the relative error of total porosity was ≤5%, pore-size distribution ≤ 8%, and pore connectivity ratio deviation ≤ 3%, confirming the physical consistency of the filled data.

2.3. VSG

To address the limitation of valid data, the MLP model is trained by using the data after expanding to 1000 and 3000 groups by the virtual sample generation (VSG) algorithm.
The VSG algorithm [43] is a density estimation algorithm based on the Gaussian mixture model, which can be used to fit an arbitrarily shaped nonlinear function by adjusting the weights to change the probability density function curve of the mixture model. Its algorithmic steps are shown in Figure 2.
Assuming that the observations X = {X1, …, Xn} of the n data points are generated by a Gaussian mixture distribution P, each vector Xi is p-dimensional and P consists of G components, the maximum mixed likelihood function for this distribution is as follows:
L M ( θ 1 , , θ G ; γ 1 , , γ n x ) = i = 1 n k = 1 G π k f k ( x i θ k ) ( π k 0 ; k = ` G π k = 1 )
If fk (xi|θk) follows a multivariate normal distribution, where θk consists of a mean μk and a covariance matrix ∑k, the density function fk (xi|θk) can be expressed as follows:
f k ( x i | μ k , k ) = exp { 1 2 x i μ k T k 1 x i μ k } 2 π P / 2 k 1 / 2
The Gaussian mixture distribution can be described by a probability density function expressed as a weighted mean of G Gaussian density functions, as shown in Equation (18):
P x | θ = k = 1 G π k f k x i | μ k , k
When generating the virtual database, assuming that the complete data is yi = (xi, zi), where xi is an observable variable and zi is a hidden variable, z i 1 , , z i G , z i k = { 1 , x i k 0 , x i k . Let zi be independently and identically distributed in the class with probability {π1, …, πG}, and the density of xi given by zi is k = 1 G f k ( x i | θ k ) z i k . The likelihood function for the complete data is shown in Equation (19):
L θ k , π k , z i k | x = i = 1 n k = 1 G z i k log π k f k x i | θ k
The parameters in Equation (19) can be solved by the Expectation Maximization (EM) algorithm, which firstly estimates the parameters by observing the data and the existing model, and uses the value of this estimated parameter to calculate the expected value of the likelihood function, as shown in Equation (20). EM-based iterative updating has also been recommended in chloride modeling reviews for its effectiveness in capturing exposure-dependent environmental variability [12]. Then, the parameters corresponding to the maximization of the likelihood function can be found according to Equation (21). The EM algorithm guarantees that the likelihood function increases with each iteration, ensuring eventual convergence.
Q θ , θ i 1 = E log L θ | X , Z = log L θ | X , Z f Z | θ , θ i 1 d Z
θ * = θ i = arg max Q θ , θ i 1
To ensure that generated virtual samples are both statistically consistent with the original dataset and physically admissible in terms of concrete microstructure and chloride transport mechanisms, the following constraints are integrated into the VSG process:
First, for statistical consistency, the GMM is trained to fit the joint probability distribution of all input and output variables of the original 144 samples, with the number of GMM components (G = 5) optimized via the Bayesian Information Criterion (BIC) to avoid distribution overfitting. Virtual samples are generated by sampling from the fitted GMM while strictly adhering to the covariance structure of the original dataset, and statistical validation confirms that key metrics of 1000 and 3000 groups of virtual samples deviate by ≤5% from the original data (Table 1).
Second, for physical admissibility, the following applies: (1) Microstructure constraints: Total porosity is bounded within 8.2–17.6%, which is consistent with the marine concrete experimental result [31], the sum of contributive porosities of different pore sizes is forced to equal total porosity, pore size distribution follows realistic ranges, pore connectivity is limited to 0.13–0.45, and tortuosity falls within 1.17–1.35; (2) Transport mechanism constraints: Chloride diffusivity (D) complies with Archie’s law and is bounded within 10−13–10−11 m2/s, avoiding unrealistic diffusivity values.
Additionally, a subset of 50 virtual samples was randomly selected and validated against independent experimental data, showing a relative error of D ≤ 10%, confirming that virtual samples align with real-world chloride transport behavior [44,45].

3. Results and Discussion

3.1. Statistical Characteristics of Data

The distribution of original data may change after different preprocessing methods, resulting in a certain impact on the training and prediction effect of the MLP model. Therefore, we conducted a statistical characterization analysis of the preprocessing data, the results of which are shown in Table 1, where Time and Rand indicate that the data is arranged by exposure time and randomly when filling the missing data, respectively.
Notably, 98.1% of KNN-imputed samples, 97.5% of Miceforest-imputed samples, and 96.3% of Lagrange-imputed samples comply with all predefined physical constraints as mentioned in Section 2.2.4, confirming that statistical consistency is maintained without violating concrete microstructure physics.
The impact of data arrangement on the KNN imputation method is relatively minor, with statistical parameters remaining largely unchanged compared to the original dataset after imputation. However, the effect of data arrangement on the Lagrange and Miceforest filling methods is more obvious. When filled by the Lagrange filling method with random arrangement, the standard deviation and skewness become larger generally, but an opposite phenomenon can be observed when filled by the Lagrange filling method with exposure-time arrangement. For the Miceforest filling method, the statistical parameters of each data arranged by exposure time are mainly close to those of the original data, while the kurtosis and skewness deviations of some data (e.g., contributive porosity with pore size less than 20 nm) become larger after arranging by random. Statistical characteristics of data filled by KNN are generally smaller than those of the original data, except for the kurtosis of contributive porosity for <20 nm pores, which is larger. After expanding the data by the VSG method, the statistical parameters are basically consistent with those of original data, except that some data (e.g., contributive porosity with pore size of 20–50 nm and 50–200 nm) have larger kurtosis deviations.

3.2. Model Prediction Results

The data after different preprocessing methods and missing value filling treatments are inserted into the MLP model, and the samples are trained to obtain the prediction results for the training and testing samples.

3.2.1. Predictive Effect of the MLP Model Without Missing Value Filling Treatment

Figure 3 gives the prediction results of the MLP model with training samples without missing value filling treatment and testing samples. Table 2 gives the results of performance evaluation indicators of the MLP model after using different data preprocessing methods.
As seen in Figure 3, the MLP model has good prediction results, i.e., the predicted values are close to the measured ones, with most predicted values fitting within the ±20% bound lines. In general, the prediction effect of standardization is slightly better than that of normalization. For the training samples, the single-point maximum relative error (2.966) and average relative error (1.198) after standardization are smaller than those of normalization (3.323 and 1.271), while the single-point minimum relative error (0.704) is slightly larger than that of normalization (0.305). For the testing samples, the single-point maximum relative error (2.199) and average relative error (1.183) after normalization are smaller than those of standardization, but the single-point minimum relative error after normalization (0.680) is smaller than that of standardization (0.823).
As can be seen from Table 2, the OBJnew value of the model using standardization is smaller (0.973) than that of the model using normalization, indicating that the MLP model with standardization is more accurate in prediction. In addition, when the MLP models predict on the testing samples, whether using standardization or normalization, they have higher R2 values and lower MSE and MAE values than those on the training samples. The reason for this may be the following: (1) due to the small sample size, the distribution of the testing samples which were randomly selected will have a big difference to that of the training samples, which results in an increasing model prediction error as a result of the larger variance in the training samples. (2) Because the training samples and testing samples are validated based on the same training samples, the existence of random seed will lead to different weighting selections for each training, and then the local minimum problem may occur in the MLP models [46]. Similar instability issues were also observed in robustness-oriented chloride-diffusion prediction frameworks under mechanical loading and freeze–thaw cycles, where stochastic weight initialization and complex deterioration interactions increased the likelihood of convergence to suboptimal minima [15].
Results from repeated 10-fold cross-validation show that the standard deviations of all performance metrics for both standardized and normalized data are ≤0.1. Specifically, the coefficient of variation for the testing set R2 is only 3.4% for normalization, indicating that sampling bias under small-sample conditions has been effectively mitigated. Combined with the observation that most predicted values fall within the ±20% error bounds in Figure 3, it is confirmed that the MLP model without missing value filling not only achieves reliable prediction accuracy but also meets the stability requirements for engineering applications.

3.2.2. Predictive Effect of MLP Model After Missing Value Filling Treatment

The training samples obtained with different data missing filling methods are fitted, the prediction results on training samples are shown in Figure 4. For the Lagrange and KNN filling method, the model with missing value filled by exposure-time arrangement has a better prediction effect, with most predicted values fitting within the ±20% bound lines, while the model with missing value filled by random arrangement has a better prediction effect for the Miceforest filling method. In addition, for the Lagrange filling method, data preprocessed by normalization has a higher R2 value of 0.92 and 0.61 for exposure-time arrangement and random arrangement, when compared with standardization whose R2 value is 0.83 and 0.54, respectively. For the KNN filling method with exposure-time arrangement, data normalization can improve the prediction effect, with an R2 value of 0.85. However, for the random arrangement, preprocessing method has little influence on the R2 value, which is 0.79 and 0.77 for normalization and standardization, respectively. For the Miceforest filling method, data arrangement and preprocessing method both have little influence on the model prediction effect, and their R2 values are closer to each other.
Similarly, the prediction results on testing samples are shown in Figure 5. For the Lagrange and Miceforest filling methods, model with missing value filled by random arrangement has a better prediction effect, with most predicted values fit within the ±20% bound lines, while the model with missing value filled by exposure-time arrangement has a better prediction effect for the KNN filling method. In addition, for the Lagrange filling method, data preprocessed by normalization has a higher R2 value of 0.57 and 0.61 for exposure-time arrangement and random arrangement, when compared with standardization whose R2 value is 0.33 and 0.51, respectively. For the KNN and Miceforest filling methods, data arrangement has an obvious influence on the model prediction effect. When filling the missing value by exposure-time arrangement, the model preprocessed by normalization has a better prediction effect, with a higher R2 value of 0.78 and 0.41 for KNN and Miceforest filling method. When compared with standardization whose R2 value is only 0.66 and 0.28 for KNN and Miceforest filling method, respectively. When filling the missing value by random arrangement, the model preprocessed by standardization has a better prediction effect for the KNN filling method, but preprocessing method has little influence on the prediction effect of the Miceforest filling method. Moreover, the prediction accuracy of some MLP models (e.g., the models with missing value filled by Lagrange and Miceforest filling method with exposure-time arrangement and by KNN filling method with random arrangement) on the testing samples decreased more than those on the training samples, the reason may be the overfitting caused by a too small data amount of testing samples.
Synthesizing the model prediction effects of training samples and testing samples, the comprehensive evaluation indicator OBJnew is calculated by considering the influences of data arrangement, preprocessing method and filling method, and the corresponding results are shown in Figure 6.
As shown in Figure 6, the exposure-time arrangement yields better prediction performance for the Lagrange and KNN imputation methods compared to random arrangement, whereas the opposite trend is observed for the Miceforest method. This phenomenon may be due to the following reasons. (1) Lagrange and KNN filling methods are centered around the proximity values of missing data to conduct data prediction, to achieve the filling of missing values, so data is closer to the origin data when filling by exposure-time arrangement, which results in a better prediction effect. (2) Miceforest is well-suited for data with skewed characteristics, hence performing better under random arrangement.
Miceforest maintains consistent performance across exposure-time and random arrangement due to its algorithmic design that reduces reliance on sample order. First, it adopts Multiple Imputation by Chained Equations (MICE) integrated with random forests to iteratively model each variable as a function of all other microstructural parameters, capturing global intrinsic correlations rather than local/sequential relationships. Second, it generates multiple imputed datasets with controlled randomness in RF sampling, averaging results to mitigate uncertainty from disordered data, an advantage lacking in KNN/Lagrange, which depend on order-dependent neighbor/adjacent point selection. In addition, it leverages RF’s robustness to skewed microstructural data (e.g., >200 nm pore ratio with skewness = 2.67, Table 1) via impurity-based feature splitting, avoiding limitations of linearity assumptions; at last, it integrates physical constraints during iteration, ensuring that imputed values comply with concrete physics regardless of data order. These traits make Miceforest well-suited for uncontrollable data arrangement scenarios, as it maintains stable prediction accuracy without relying on ordered, physically similar samples.
When using Lagrange filling, normalization consistently yields higher prediction accuracy than standardization, regardless of arrangement. The main reason may be that the normalization preprocessing can mitigate the overfitting phenomenon with a small sample size [47]. When the Miceforest and KNN methods are used for filling missing value, normalization preprocessing can lead to a better prediction effect when data is arranged by exposure time, but standardization preprocessing can lead to a better prediction effect when data is arranged by random, since random arrangement will produce more outlier data, which has a more significant impact on normalization.
To further clarify the mechanisms underlying the performance differences between imputation schemes, we analyze the link between imputation method characteristics, data arrangement, and model accuracy. The KNN imputation method paired with exposure-time arrangement and normalization yields the best comprehensive performance, attributed to three synergistic factors. (1) KNN relies on Euclidean distance to select neighboring samples for imputation. Concrete microstructural parameters evolve dynamically with exposure time, exposure-time arrangement groups samples with similar microstructural states, ensuring KNN’s neighbors are physically meaningful. Imputed values thus inherit time-dependent microstructural trends, avoiding the irrationality of random neighbor selection. (2) Chloride diffusivity is inherently time-dependent, and microstructural parameters have a strong positive correlation with exposure time. This arrangement maintains the causal chain of exposure time–microstructural evolution–chloride diffusivity, enabling KNN to impute values consistent with concrete durability physics. In contrast, random arrangement breaks this correlation, leading to dissimilar neighbors and unreliable imputation. (3) Normalization maps data to [0, 1], preserving the relative relationships between microstructural parameters that are critical for KNN’s distance metric. For exposure-time arranged data with consistent trends, normalization ensures distance reflects true physical similarity, rather than being dominated by absolute value differences.
The performance of each imputation method is determined by its algorithmic design and compatibility with data characteristics. (1) Lagrange filling relies on linear polynomial interpolation between adjacent samples, assuming continuous linear parameter relationships. This works for exposure-time arranged data but fails to capture nonlinear interactions. For random arranged data, linear interpolation between dissimilar samples causes large imputation errors, leading to low model accuracy. (2) KNN filling is non-parametric and similarity-based, and its performance is highly dependent on data arrangement. Exposure-time arrangement provides physically similar neighbors, maximizing KNN’s strengths; random arrangement leads to dissimilar neighbors, reducing reliability. KNN is also sensitive to feature scaling, normalization outperforms standardization for arranged data, while standardization is better for random data. (3) Miceforest filling uses random forest-based chained equations to model multi-feature interactions iteratively. It is insensitive to data arrangement because it learns global parameter correlations rather than relying on sample order. For random arranged data, this robustness ensures stable performance, but its probabilistic nature introduces minor noise, making it less accurate than KNN for well-arranged data.

3.2.3. Predictive Effect of MLP Model with Data Expanding by VSG

The 1000 and 3000 groups of data expanded by the VSG method are fitted, the prediction results on training samples and testing samples are shown in Figure 7.
As shown in Figure 7, the MLP model after the VSG expanding method has a higher prediction accuracy, most predicted values fit within the ±20% bound lines, for whatever preprocessing methods. Overall, the model prediction effect is better after expanding to 3000 groups of data, but the corresponding relative single-point error is larger for normalized data, with the maximum single-point error increasing from 2.448 (1000 groups) to 3.863 (3000 groups) (Figure 7(2,5)). This phenomenon arises from three key factors: the intrinsic characteristics of VSG generation, the data-driven nature of the MLP model, and the mathematical properties of normalization:
First, expanding from 1000 to 3000 groups enhances prediction accuracy by filling gaps in the parameter space. The original 144 samples have limited coverage of microstructural variability. VSG-generated samples preserve the original data’s statistical and physical consistency, and 3000 groups provide more comprehensive representation of parameter interactions. This enables the MLP model to learn more universal nonlinear relationships between microstructural parameters and chloride diffusivity, reducing overfitting to small-sample noise and stabilizing weight optimization via backpropagation, evidenced by testing R2 increasing from 0.84 to 0.85 for standardization and 0.78 to 0.85 for normalization (as shown in Table 3).
Second, normalized data introduces more noise with 3000-group expansion due to its sensitivity to extreme values. Normalization scales data to the [0, 1] range using the original dataset’s min/max values, compressing the data range and amplifying small absolute differences in extreme values. Expanding to 3000 groups increases the probability of generating virtual samples near the distribution tails via GMM sampling. These tail samples act as outliers in the normalized space, and the MLP’s ReLU activation function further amplifies these input variations, leading to larger single-point errors. In contrast, standardization uses mean and standard deviation for scaling, mitigating the impact of outliers and resulting in more stable error performance. Additionally, minor statistical uncertainty from iterative EM algorithm optimization accumulates with more samples, and normalization’s range-dependent scaling exacerbates this uncertainty, whereas standardization’s distribution-based scaling suppresses it.
For the training samples, the model preprocessed with standardization yields better results for both expansion scales, as standardization’s robustness to noise aligns with MLP’s need for stable training signals. For the testing samples, standardization outperforms normalization at 1000 groups due to fewer outliers, while the two methods show comparable performance at 3000 groups, since the expanded data provides sufficient reliable training signals to offset the noise introduced in normalized data.
Table 3 shows the results of performance evaluation indicators of the MLP model after using different data preprocessing methods.
Comparing Table 3 with Table 2, it can be observed that the performance evaluation indicators are obviously improved when data is expanding by the VSG method. When increasing data from 1000 groups to 3000 groups, the OBJnew values are both decreased, for both standardization and normalization preprocessing methods, which is decreased by 17.25% and 27.26%, respectively, indicating that data expansion has a more pronounced effect on normalization.
Repeated 10-fold cross-validation results show that after expanding to 3000 groups via VSG, the standard deviations of the testing set R2 for standardized and normalized preprocessing decrease to 0.022 and 0.024, respectively (Table 3), which are further reduced compared to 1000-group data (0.027 and 0.032). This indicates that the combination of increased data volume and the enhanced validation strategy significantly improves the model’s stability against sampling variability. Even though normalization is sensitive to outliers, the standard deviation of the testing set MAE under 3000-group data is still controlled within 0.033, confirming that data expansion effectively offsets the negative impact of extreme values.
Moreover, MSE is used as a loss function to analyze the deviation between true value of a single sample and predicted value of the MLP model. It is well known that a smaller loss function value indicates a better model fitting ability [48]. Figure 8 shows the relationship between loss function value and number of epochs in the MLP model with data expanded by the VSG method.
Figure 8 shows the loss function for the different models. The loss function represents the deviation between the real value of a single sample and the predicted value of the model, which is usually used to measure the performance of the model. The smaller the loss function is, the better the fitting degree of the model. It can be seen that the loss function value appears to be sharply reduced first and then tends to stabilize with the increasing number of epochs. This stabilization indicates the model has reached optimal convergence rather than overfitting, supported by multiple lines of evidence:
First, the stabilization stems from the SGD optimizer’s weight convergence mechanism. In early epochs, randomly initialized weights lead to large prediction errors, and SGD rapidly adjusts weights to minimize MSE loss. As epochs increase, the gradient of the loss function diminishes, and weights approach the global minimum of the loss surface, further updates cause negligible weight changes, resulting in stable loss values. This aligns with SGD’s inherent characteristic of converging to near-optimal solutions for regression tasks with convex loss landscapes (e.g., MSE).
Second, overfitting is effectively mitigated by targeted training strategies. Overfitting would manifest as diverging training/testing performance (e.g., decreasing training loss but increasing testing loss), but this study avoids this via (1) 10-fold cross-validation to reduce sampling bias; (2) an optimized MLP architecture (two hidden layers, 16 neurons each) balanced for complexity (sensitivity analysis showed deeper/wider networks overfit); and (3) data preprocessing and VSG expansion to suppress noise. As shown in Table 3, testing R2 (0.85 for 3000-group normalization) is consistent with training R2 (0.93), and OBJnew decreases with data expansion—confirming no overfitting.
Third, physical constraints of chloride transport limit further loss reduction [49]. Chloride diffusivity is governed by inherent physical laws (e.g., porosity–tortuosity relationships), which define a lower bound for prediction error. Once the model captures these core nonlinear mechanisms, additional epochs cannot reduce loss beyond this physically meaningful limit, leading to stable loss values.
Moreover, for the same expanding group, the loss function of data preprocessed by the normalization method decreases at a faster rate than that by standardization. When expanding data to 1000 groups, the model reaches stability within about 1500 epochs and 600 epochs for standardization and normalization preprocessing, respectively. When expanding data to 3000 groups, the model stabilizes within about 600 epochs for normalization preprocessing, but it is still not stable after 3000 epochs for standardization preprocessing. This reflects the larger dataset requiring more iterations for SGD to converge, not overfitting, as testing R2 remains stable (0.85) and OBJnew is the lowest (0.334). No obvious jumps and oscillations are observed in the loss curves of the MLP model with normalization, and a small loss function value (less than 0.009) is achieved when the number of epochs exceeds 1500, indicating the MLP model with normalization is more stable.
Furthermore, the prediction effect of the MLP model is significantly improved when data is expanded by the VSG method, mainly because the distributions of training and testing samples become more uniform after expansion, reducing accidental phenomena. Table 3 shows that expanding data from 1000 groups to 3000 groups decreases the OBJnew value (by 17.25% for standardization and 27.26% for normalization), indicating higher prediction accuracy with larger data volume. This is attributed to MLP’s nature as a data-driven method, which requires sufficient data for training. With an increase in expanded groups, the impact of noise data on normalization is reduced, narrowing the performance gap between normalization and standardization. Additionally, a larger database increases the risk of neuron saturation, and normalization helps prevent saturation and maintain convergence rate, resulting in a more stable model [50].
Combined with repeated 10-fold cross-validation results, the stable value of the loss function for 3000-group standardized data (0.05 ± 0.008) shows good consistency with the testing set R2 (0.85 ± 0.022), with no overfitting characteristics such as decreasing training loss and increasing testing loss. This further verifies the stability of the 2-layer MLP architecture under expanded data.

4. Conclusions and Future Research Directions

4.1. Conclusions

Establishing a reliable and accurate prediction model for chloride diffusion is crucial for saving time and cost in durability assessment. In this paper, a database comprising 144 sets of chloride diffusion coefficient data and corresponding microstructural test results was established based on long-term marine environment exposure experiments and laboratory simulations to train the MLP model. To overcome the influence of data incompleteness, the Lagrange, KNN and Miceforest methods are used to fill the missing data. Moreover, to address the issue of limited effective data, the VSG method is adopted to improve the accuracy of the constructed MLP model. The main conclusions are as follows.
(1)
The developed MLP model exhibits excellent predictive performance, with most predicted values falling within the ±20% error bounds. However, the small sample size introduces inherent randomness that impacts prediction stability: the uneven distribution of the 10 testing samples (randomly selected from 144 in total) and stochastic weight initialization in MLP may lead to fluctuating errors (e.g., training vs. testing R2 inconsistencies). Without data filling or expansion, the MLP model preprocessed by normalization is less accurate than that by standardization, attributed to normalization’s sensitivity to outliers and extreme values in small datasets, while standardization preserves relative data distribution and mitigates such interference.
(2)
For databases with missing values (40% missing rate simulated), the Lagrange, KNN and Miceforest imputation methods all enable the MLP model to achieve reliable predictive performance, but data arrangement during imputation significantly influences results due to the algorithmic characteristics of each method. Specifically, Lagrange and KNN (local similarity-based methods) perform better with exposure-time arrangement. This arrangement groups samples with similar microstructural evolution stages, ensuring imputed values inherit time-dependent physical trends, avoiding distorted results from dissimilar neighbors or interpolation points in random arrangement. Miceforest (a global correlation-based method) is superior with random arrangement. It learns interactions between all microstructural parameters via random forest-chained equations, rather than relying on local order, and its robustness to skewed data further stabilizes performance in unordered datasets. Overall, the model preprocessed by normalization, with KNN imputation under exposure-time arrangement, achieved the best prediction accuracy.
(3)
Training the MLP model with VSG-expanded data significantly improves the prediction accuracy, with the 3000-group expanded dataset outperforming the 1000-group dataset. This is because VSG fills gaps in the parameter space of the original small sample, enabling the model to learn more universal nonlinear relationships between microstructural parameters and chloride diffusivity. However, VSG introduces subtle outliers when expanding to 3000 groups, which have a more pronounced impact on normalization: normalization scales data to the [0, 1] range using the original dataset’s min/max values, compressing value ranges and amplifying outlier-induced noise. In contrast, standardization mitigates this issue via mean-std scaling, resulting in more stable error performance. Notably, normalization enhances the model’s generalization ability and stability but is more susceptible to outliers, while the influence of such outliers diminishes with larger expanded datasets, narrowing the performance gap between normalization and standardization.

4.2. Future Research Directions

Despite the proposed MLP model demonstrating excellent predictive performance for concrete chloride diffusivity, several limitations persist, providing opportunities for further optimization and expansion:

4.2.1. Current Limitations

The model is trained on a foundational dataset of 144 samples. Although VSG-generated virtual samples adhere to strict physical constraints and statistical consistency with the original data, they cannot fully replicate the inherent complexity of real-world concrete microstructures. For instance, stochastic crack formation, interfacial transition zone (ITZ) inhomogeneity, and localized pore coarsening under actual service conditions are not fully captured in virtual samples.
Exposure time is treated as a control variable rather than a dynamic evolutionary factor. The model lacks time-aware architectures to capture the continuous, sequential degradation of concrete microstructures over long-term service, limiting its ability to predict time-dependent chloride diffusivity trends.

4.2.2. Prospective Research Directions

(1)
Collect monthly or quarterly microstructural data from long-term exposure experiments. This sequential dataset will support the training of time-aware models such as Long Short-Term Memory (LSTM) or temporal Convolutional Neural Networks (temporal CNN), enabling dynamic prediction of chloride diffusivity that reflects real-time microstructural evolution.
(2)
Incorporate factors such as cyclic mechanical loading, freeze–thaw cycles, marine tidal fluctuations, and temperature variations into the input variables. This integration will enhance the model’s generalizability across complex service scenarios, addressing the coupled deterioration mechanisms that affect concrete chloride transport in practical engineering.
(3)
Refine the GMM parameterization in the VSG algorithm to minimize the generation of outlier virtual samples. This improvement will mitigate the sensitivity of normalization to extreme values, further narrowing the performance gap between normalization and standardization in large-scale virtual datasets.
(4)
Conduct model validation using concrete samples collected from diverse marine regions with varying environmental conditions and concrete mix designs. This will verify the model’s engineering applicability and robustness, ensuring its reliability for practical durability assessment. Additionally, Monte Carlo cross-validation (MCCV) will be implemented to enhance the evaluation of model robustness [51]. The MCCV results will be compared with those of the current 10-fold cross-validation to quantitatively verify the reliability of the observed high R2 values and assess the model’s resistance to sampling variability.

Author Contributions

Conceptualization, R.F.; Methodology, R.F.; Software, J.Z. and S.M.; Validation, Q.L. and S.M.; Investigation, J.Z.; Data curation, R.F.; Writing—original draft, Q.L.; Writing—review and editing, Q.L., Z.G. and S.M.; Visualization, J.Z.; Project administration, Z.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Laboratory of Green Construction and Intelligent Operation & Maintenance for Coastal Infrastructure (Zhejiang University of Technology, Hangzhou 310014, China) grant number ZKLCDF230201.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

Author Jiaming Zhu was employed by the company Zhejiang Academy of Building Research & Design Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflicts of interest.

References

  1. Djenaoucine, L.; Argiz, C.; Picazo, A.; Gálvez, J.C. The corrosion-inhibitory influence of graphene oxide on steel reinforcement embedded in concrete exposed to a 3.5M NaCl solution. Cem. Concr. Compos. 2025, 155, 105835. [Google Scholar]
  2. Djenaoucine, L.; Picazo, Á.; de la Rubia, M.Á.; Moragues, A.; Gálvez, J.C. Influence of Graphene Oxide on Mechanical Properties and Durability of Cement Mortar. Materials 2024, 17, 1445. [Google Scholar] [CrossRef] [PubMed]
  3. Zhang, Q.Z.; Zhao, M.Z.; Song, L.; Yang, Y.H.; He, J.M. The evolution of concrete microstructure and chloride-ion diffusion coefficient under cyclic axial compression. Int. J. Concr. Struct. Mater. 2025, 19, 56. [Google Scholar] [CrossRef]
  4. Tong, L.Y.; Liu, Q.F.; Xiong, Q.X.; Meng, Z.Z.; Amiri, O.; Zhang, M.Z. Modeling the chloride transport in concrete from microstructure generation to chloride diffusivity prediction. Comput. Aided Civ. Infrastruct. Eng. 2025, 40, 1129–1149. [Google Scholar]
  5. Luo, R.; Cai, Y.; Wang, C.; Huang, X.M. Study of chloride binding and diffusion in GGBS concrete. Cem. Concr. Res. 2003, 33, 1–7. [Google Scholar] [CrossRef]
  6. Moon, H.; Kim, H.; Choi, D. Relationship between average pore diameter and chloride diffusivity in various concretes. Constr. Build. Mater. 2006, 20, 725–732. [Google Scholar] [CrossRef]
  7. Zhang, M.; Li, H. Pore structure and chloride permeability of concrete containing nano-particles for pavement. Constr. Build. Mater. 2011, 25, 608–616. [Google Scholar] [CrossRef]
  8. Jin, L.B.; Yu, H.L.; Wang, Z.; Wang, Z.Q.; Fan, T. Developing a model for chloride transport through concrete considering the key factors. Case Stud. Constr. Mater. 2022, 17, e01168. [Google Scholar] [CrossRef]
  9. Qi, D.; Zheng, H.; Zhang, L.; Sun, G.W.; Yang, H.T.; Li, Y.F. Numerical simulation on diffusion reaction behavior of concrete under sulfate chloride coupled attack. Constr. Build. Mater. 2023, 405, 133237. [Google Scholar] [CrossRef]
  10. Shazali, M.A.; Rahman, M.K.; Al-Gadhib, A.H.; Baluch, M.H. Transport modeling of chlorides with binding in concrete. Arab. J. Sci. Eng. 2012, 37, 469–479. [Google Scholar] [CrossRef]
  11. Djenaoucine, L.; Picazo, A.; Rubia, M.A.; Gálvez, J.C.; Moragues, A. Effect of graphene oxide on the hydration process and macro-mechanical properties of cement. J. Span. Ceram. Glass Soc. 2024, 63, 294–303. [Google Scholar] [CrossRef]
  12. Zhao, R.Q.; Li, C.F.; Guan, X.M. Advances in modeling surface chloride concentrations in concrete serving in the marine environment: A mini review. Buildings 2024, 14, 1879. [Google Scholar] [CrossRef]
  13. Cai, R.; Han, T.H.; Liao, W.Y.; Li, D.W.; Kumar, A.; Ma, H.Y. Prediction of surface chloride concentration of marine concrete using ensemble machine learning. Cem. Concr. Res. 2020, 136, 106164. [Google Scholar] [CrossRef]
  14. Zhang, H.P.; Li, X.C.; Amin, M.N.; Al-Naghi, A.A.A.; Ul Arifeen, S.; Althoey, F.; Ahmad, A. Analyzing chloride diffusion for durability predictions of concrete using contemporary machine learning strategies. Mater. Today Commun. 2024, 38, 108543. [Google Scholar] [CrossRef]
  15. Li, Y.F.; Xiao, H.G.; Liu, J.L. Robust machine learning framework for predicting chloride ion diffusion in concrete under load and freeze-thaw conditions. Case Stud. Constr. Mater. 2025, 23, e05249. [Google Scholar] [CrossRef]
  16. Liu, Q.F.; Iqbal, M.F.; Yang, J.; Lu, X.Y.; Zhang, P.; Rauf, M. Prediction of chloride diffusivity in concrete using artificial neural network: Modelling and performance evaluation. Constr. Build. Mater. 2021, 268, 121082. [Google Scholar] [CrossRef]
  17. Boğa, A.R.; Öztürk, M.; Topçu, İ.B. Using ANN and ANFIS to predict the mechanical and chloride permeability properties of concrete containing GGBFS and CNI. Compos. Part B Eng. 2013, 45, 688–696. [Google Scholar] [CrossRef]
  18. Ahmad, A.; Farooq, F.; Ostrowski, K.A.; Sliwa-Wieczorek, K.; Czarnecki, S. Application of novel machine learning techniques for predicting the surface chloride concentration in concrete containing waste material. Materials 2021, 14, 2297. [Google Scholar] [CrossRef] [PubMed]
  19. Deng, L.; Yu, D. Deep Learning: Methods and Applications; China Machine Press: Beijing, China, 2016; pp. 32–64. [Google Scholar]
  20. Abdellatief, M.; Abd-Elmaboud, M.E.; Saqr, A.M. A convolutional neural network-based deep learning approach for predicting surface chloride concentration of concrete in marine tidal zones. Sci. Rep. 2025, 15, 27611. [Google Scholar] [CrossRef]
  21. Polyzotis, N.; Roy, S.; Whang, S.E.; Zinkevich, M. Data management challenges in production machine learning. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Chicago, IL, USA, 14–19 May 2017; pp. 1723–1726. [Google Scholar]
  22. Zhou, F.Y.; Tao, N.J.; Zhang, Y.R.; Yuan, W.B. Prediction of chloride diffusion coefficient in concrete based on machine learning and virtual sample algorithm. Sustainability 2023, 15, 16896. [Google Scholar] [CrossRef]
  23. Kim, W.; Cho, W.; Choi, J.; Kim, J.; Park, C.; Choo, J. A comparison of the effects of data imputation methods on model performance. In Proceedings of the 21st International Conference on Advanced Communication Technology (ICACT), PyeongChang, Republic of Korea, 17–20 February 2019; pp. 592–599. [Google Scholar]
  24. Gong, H.; Chen, Z.; Zhu, Q.; He, Y.L. A Monte Carlo and PSO based virtual sample generation method for enhancing the energy prediction and energy optimization on small data problem: An empirical study of petrochemical industries. Appl. Energy 2017, 197, 405–415. [Google Scholar] [CrossRef]
  25. Yang, J.; Yu, X.; Xie, Z.Q.; Zhang, J.P. A novel virtual sample generation method based on Gaussian distribution. Knowl. -Based Syst. 2011, 24, 740–748. [Google Scholar] [CrossRef]
  26. Zhang, Y.R.; Yu, W.L.; Wang, L.L.; Tang, K. Prediction of chloride concentration in fly ash concrete based on one-dimensional convolutional neural network. J. Zhejiang Univ. Technol. 2024, 52, 156–163. (In Chinese) [Google Scholar]
  27. Li, L.Q.; Su, L.; Guo, B.C.; Cai, R.J.; Wang, X.; Zhang, T. Prediction and prevention of concrete chloride penetration: Machine learning and MICP techniques. Front. Mater. 2024, 11, 1445547. [Google Scholar] [CrossRef]
  28. Zhang, Y.R.; Yu, W.L.; Ma, X.Q.; Luo, T.Y.; Wang, J.J. Prediction of chloride concentration in fly ash concrete based on deep learning. J. Beijing Univ. Technol. 2023, 49, 205–212. [Google Scholar]
  29. Zhang, Y.R.; Zhu, T.F.; Yu, W.L.; Fu, C.Q.; Liu, X.J.; Wan-Wendner, L. Prediction of free chloride concentration in fly ash concrete by machine learning methods. Mag. Concr. Res. 2024, 76, 1279–1289. [Google Scholar] [CrossRef]
  30. Gao, Y.H.; Shao, X.J.; Zhang, Y.R.; Fang, R.H.; Zhang, J.Z. Permeability dependency of fly ash concrete in natural tidal environment. J. Hydroelectr. Eng. 2021, 40, 214–222. (In Chinese) [Google Scholar]
  31. Gao, Y.H.; Guo, B.L.; Wang, M.; Zhang, Y.R.; Zhang, J.Z. Stable time and mechanism of concrete permeability in natural tidal environment. J. Hydroelectr. Eng. 2022, 41, 50–62. (In Chinese) [Google Scholar]
  32. Zhang, J.; Zhou, X.; Zhao, J.; Wang, M.; Gao, Y.H.; Zhang, Y.R. Similarity of chloride diffusivity of concrete exposed to different environments. ACI Mater. J. 2020, 117, 27–37. [Google Scholar] [CrossRef]
  33. Zhang, J.; Wu, J.; Zhang, Y.; Gao, Y.H.; Wang, J.D. Time-varying relationship between pore structures and chloride diffusivity of concrete under the simulated tidal environment. Eur. J. Environ. Civ. Eng. 2022, 26, 501–518. [Google Scholar] [CrossRef]
  34. Zhang, Y.R.; Wu, S.Y.; Ma, X.Q.; Fang, L.C.; Zhang, J.Z. Effects of additives on water permeability and chloride diffusivity of concrete under marine tidal environment. Constr. Build. Mater. 2022, 320, 126217. [Google Scholar] [CrossRef]
  35. Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev. 1958, 65, 386–408. [Google Scholar] [CrossRef]
  36. Dang, Q.L.; Luo, R.H.; Xie, L.L.; Gao, X.C.; Bai, W.T. Multilayer perceptron-based offspring prediction model for constrained multi-objective optimization. Eng. Appl. Artif. Intell. 2025, 162, 112428. [Google Scholar] [CrossRef]
  37. Patrício, J.D.; Gusmão, A.D.; Ferreira, S.R.M.; Silva, F.A.N.; Kafshgarkolaei, H.J.; Azevedo, A.C.; Delgado, J.M.P.Q. Settlement analysis of concrete-walled buildings using soil-structure interactions and finite element modeling. Buildings 2024, 14, 746. [Google Scholar] [CrossRef]
  38. Liu, X. Study on Data Normalization in BP Neural Network. Mech. Eng. Autom. 2010, 160, 122–123+126. (In Chinese) [Google Scholar]
  39. Vu, H.L.; Ng, K.T.W.; Richter, A.; An, C.J. Analysis of input set characteristics and variances on k-fold cross validation for a Recurrent Neural Network model on waste disposal rate estimation. J. Environ. Manag. 2022, 311, 114869. [Google Scholar] [CrossRef]
  40. Nguyen, X.C.; Nguyen, T.T.H.; La, D.D.; Kumar, G.; Rene, E.R.; Nguyen, D.D.; Chang, S.W.; Chung, W.J.; Nguyen, X.H.; Nguyen, V.K. Development of machine learning—Based models to forecast solid waste generation in residential areas: A case study from Vietnam. Resour. Conserv. Recycl. 2021, 167, 105381. [Google Scholar]
  41. Garre, A.; Ruiz, M.C.; Hontoria, E. Application of machine learning to support production planning of a food industry in the context of waste generation under uncertainty. Oper. Res. Perspect. 2020, 7, 100147. [Google Scholar] [CrossRef]
  42. Golafshani, E.M.; Behnood, A.; Arashpour, M. Predicting the compressive strength of normal and high-performance concretes using ANN and ANFIS hybridized with Grey Wolf Optimizer. Constr. Build. Mater. 2020, 232, 117266. [Google Scholar] [CrossRef]
  43. Shen, L.; Qian, Q. A virtual sample generation algorithm supporting machine learning with a small-sample dataset: A case study for rubber materials. Comput. Mater. Sci. 2022, 211, 111475. [Google Scholar] [CrossRef]
  44. Razmi, A.; Bennett, T.; Xie, T.; Visintin, P. A phenomenological model for chloride diffusion coefficient in concretes with traditional and blended binders and alternative fillers. Constr. Build. Mater. 2025, 461, 139783. [Google Scholar] [CrossRef]
  45. Liu, W.; Yu, H.; Ma, H.; Yang, H.; Wang, W. Experimental investigation into the influence of sulfate ions on the chloride diffusion coefficient and binding capacity in concrete. Constr. Build. Mater. 2025, 500, 144215. [Google Scholar] [CrossRef]
  46. Brady, M.; Raghavan, R.; Slawny, J. Gradient descent fails to separate. In Proceedings of the IEEE International Conference on Neural Networks, San Diego, CA, USA, 24–27 July 1988; pp. 649–656. [Google Scholar]
  47. Sergey, L.; Chirstian, S. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning(ICML), Lille, France, 6–11 July 2015; Volume 1, pp. 448–456. [Google Scholar]
  48. Deng, J.G.; Zhang, S.L.; Zhang, J.F.; Xun, Y.L.; Liu, A.Q. Loss function and application research in supervised learning. Big Data 2020, 6, 60–80. (In Chinese) [Google Scholar]
  49. Gu, G.H.; Ma, T.; Qian, R.S.; Ye, H.L.; Wan-Wendner, L.; Fu, C.Q. Evolution of chloride binding and mechanical behavior in metakaolin-based geopolymer: Role of MgO-induced phase changes. Constr. Build. Mater. 2025, 478, 141424. [Google Scholar]
  50. Ma, X.; Huang, H.; Wang, Y.; Romano, S.; Erfani, S.; Bailey, J. Normalized loss functions for deep learning with noisy labels. In Proceedings of the 37th International Conference on Machine Learning(ICML), Online, 13–18 July 2020; Volume 9, pp. 6499–6509. [Google Scholar]
  51. Naeim, B.; Javadzade Khiavi, A.; Khajavi, E.; Taghavi Khanghah, A.R.; Asgari, A.; Taghipour, R.; Bagheri, M. Machine Learning Approaches for Fatigue Life Prediction of Steel and Feature Importance Analyses. Infrastructures 2025, 10, 295. [Google Scholar] [CrossRef]
Figure 1. Schematic diagram of the MLP model.
Figure 1. Schematic diagram of the MLP model.
Buildings 16 00513 g001
Figure 2. Steps of the VSG algorithm.
Figure 2. Steps of the VSG algorithm.
Buildings 16 00513 g002
Figure 3. Prediction results of data without missing treatment.
Figure 3. Prediction results of data without missing treatment.
Buildings 16 00513 g003
Figure 4. Prediction results of an MLP model with different filling methods: training samples.
Figure 4. Prediction results of an MLP model with different filling methods: training samples.
Buildings 16 00513 g004aBuildings 16 00513 g004b
Figure 5. Prediction results of MLP model with different filling methods: testing samples.
Figure 5. Prediction results of MLP model with different filling methods: testing samples.
Buildings 16 00513 g005aBuildings 16 00513 g005b
Figure 6. OBJnew calculation results.
Figure 6. OBJnew calculation results.
Buildings 16 00513 g006
Figure 7. Prediction results of samples by the VSG expanding method: (1) 1000 data with standardization for training samples; (2) 1000 data with normalization for training samples; (3) 1000 data for testing samples; (4) 3000 data with standardization for training samples; (5) 3000 data with normalization for training samples; and (6) 3000 data for testing samples.
Figure 7. Prediction results of samples by the VSG expanding method: (1) 1000 data with standardization for training samples; (2) 1000 data with normalization for training samples; (3) 1000 data for testing samples; (4) 3000 data with standardization for training samples; (5) 3000 data with normalization for training samples; and (6) 3000 data for testing samples.
Buildings 16 00513 g007
Figure 8. Relationship between loss function value and number of epochs.
Figure 8. Relationship between loss function value and number of epochs.
Buildings 16 00513 g008
Table 1. Statistical characteristics of input data.
Table 1. Statistical characteristics of input data.
VariablesParametersOriginal DataLagrangeKNNMiceforestVSG
TimeRandTimeRandTimeRand10003000
PorosityStdev.1.381.481.551.411.411.401.391.341.35
Kurt.2.201.681.952.182.182.122.111.962.14
Skew.1.241.251.431.361.361.231.171.161.26
<20 nmStdev.0.450.480.600.420.420.440.460.460.44
Kurt.1.431.112.762.372.371.591.641.381.27
Skew.0.750.581.350.730.730.740.840.880.72
20–50 nmStdev.0.380.380.460.380.380.360.390.380.37
Kurt.1.830.310.220.590.590.650.362.361.95
Skew.−0.10−0.090.44−0.07−0.070.04−0.18−0.06−0.14
50–200 nmStdev.0.900.820.860.810.810.930.890.880.90
Kurt.3.623.436.583.343.343.192.842.893.65
Skew.1.691.532.161.561.561.651.531.551.74
>200 nmStdev.0.620.640.500.590.590.550.580.620.60
Kurt.8.616.2010.426.616.618.427.189.108.44
Skew.2.672.352.802.402.402.622.472.782.66
Note: Stdev.—standard deviation; Kurt.—kurtosis; Skew.—skewness. Meanwhile, <20 nm, 20–50 nm, 50–200 nm and >20 nm represent the contributive porosity of the corresponding pore size, respectively.
Table 2. Performance indicators of the original data set (expressed by mean ± std).
Table 2. Performance indicators of the original data set (expressed by mean ± std).
StandardizationNormalization
Training TestingTraining Testing
MAE0.770 ± 0.0420.591 ± 0.0530.928 ± 0.0670.572 ± 0.049
MSE1.186 ± 0.0890.489 ± 0.0611.523 ± 0.1120.424 ± 0.057
R20.78 ± 0.0210.80 ± 0.0340.72 ± 0.0280.83 ± 0.029
OBJnew0.937 ± 0.0511.102 ± 0.073
Table 3. Performance evaluation indicators of the MLP model by data expanding (expressed by mean ± std).
Table 3. Performance evaluation indicators of the MLP model by data expanding (expressed by mean ± std).
1000 Groups3000 Groups
StandardizationNormalizationStandardizationNormalization
Training Testing Training Testing Training Testing Training Testing
MAE0.285 ± 0.0230.571 ± 0.0410.489 ± 0.0350.549 ± 0.0380.203 ± 0.0180.511 ± 0.0350.418 ± 0.0290.465 ± 0.033
MSE0.180 ± 0.0210.394 ± 0.0380.442 ± 0.0420.558 ± 0.0450.096 ± 0.0130.377 ± 0.0310.313 ± 0.0340.372 ± 0.036
R20.96 ± 0.0120.84 ± 0.0270.89 ± 0.0150.78 ± 0.0320.98 ± 0.0090.85 ± 0.0220.93 ± 0.0110.85 ± 0.024
OBJnew0.404 ± 0.0350.604 ± 0.0480.334 ± 0.0280.439 ± 0.039
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fu, R.; Lu, Q.; Zhu, J.; Gao, Z.; Mei, S. Prediction of Chloride Diffusion Coefficient in Concrete by Micro-Structural Parameters Based on the MLP Method by Considering Data Missing and Small Sample in Database. Buildings 2026, 16, 513. https://doi.org/10.3390/buildings16030513

AMA Style

Fu R, Lu Q, Zhu J, Gao Z, Mei S. Prediction of Chloride Diffusion Coefficient in Concrete by Micro-Structural Parameters Based on the MLP Method by Considering Data Missing and Small Sample in Database. Buildings. 2026; 16(3):513. https://doi.org/10.3390/buildings16030513

Chicago/Turabian Style

Fu, Rongze, Qimin Lu, Jiaming Zhu, Zhiji Gao, and Shengqi Mei. 2026. "Prediction of Chloride Diffusion Coefficient in Concrete by Micro-Structural Parameters Based on the MLP Method by Considering Data Missing and Small Sample in Database" Buildings 16, no. 3: 513. https://doi.org/10.3390/buildings16030513

APA Style

Fu, R., Lu, Q., Zhu, J., Gao, Z., & Mei, S. (2026). Prediction of Chloride Diffusion Coefficient in Concrete by Micro-Structural Parameters Based on the MLP Method by Considering Data Missing and Small Sample in Database. Buildings, 16(3), 513. https://doi.org/10.3390/buildings16030513

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop