Non-Destructive Micromagnetic Determination of Hardness and Case Hardening Depth Using Linear Regression Analysis and Artiﬁcial Neural Networks

: Non-destructive determination of workpiece properties after heat treatment is of great interest in the context of quality control in production but also for prevention of damage in subsequent grinding process. Micromagnetic methods offer good possibilities, but must ﬁrst be calibrated with reference analyses on known states. This work compares the accuracy and reliability of different calibration methods for non-destructive evaluation of carburizing depth and surface hardness of carburized steel. Linear regression analysis is used in comparison with new methods based on artiﬁcial neural networks. The comparison shows a slight advantage of neural network method and potential for further optimization of both approaches. The quality of the results can be inﬂuenced, among others, by the number of teaching steps for the neural network, whereas more teaching steps does not always lead to an improvement of accuracy for conditions not included in the initial calibration.


Introduction
The heat treatment state of a case-hardened steel workpiece especially surface hardness and case hardening depth, which also often correlates with the surface oxidation depth, are important properties for the final service properties of high-performance parts, which have to be continuously controlled in industrial production. Further, these surface properties influence the grindability of the components as well as the micromagnetic detectability of grinding damages [1,2]. Therefore, the knowledge of these properties can allow determining optimal grinding parameters as a function of the component's initial state and can enable a reliable in-process micromagnetic monitoring of the grinding processes [3]. Besides the known destructive testing methods which are precise but time-consuming, the heat treatment condition can also be assessed non-destructively using micromagnetic methods such as Barkhausen noise analysis or the 3MA-technique [4].
Using micromagnetic methods, hardness is in most cases determined by calibration using linear regression [4,5]. To determine the case hardening depth by means of Barkhausen noise analysis, several additional approaches based on the different properties of soft core and hard surface layer exist. For example, Send et al. observed additional peaks and asymmetries of the Barkhausen peak generated by the hardened case that could be described with different parameters and correlated with the case hardening depth [6]. In [7] Santa-aho et al. employed sweeps of the magnetization voltage to determine the maximum slope of Barkhausen noise for two magnetizing frequencies (and therefore different penetration depth) and correlated the ratio with the case hardening depth. The 3MAtechnique combines four micromagnetic methods with different penetration depth. Besides the Barkhausen noise, it includes, among others, the method of harmonic analysis of the tangential magnetic field. Here, the case hardening could be correlated to the frequency limit at which the distortion factor is near to an asymptote [8]. Further influences of case depth on the frequency characteristics of various micromagnetic signals were observed in [9], but could not be fully explained.
In [10] and [11], among others, artificial neural networks were used to determine hardness from hysteresis loop and eddy current measurement respectively Barkhausen noise and tangential magnetic field analysis. Besides hardness, Liu et al. determined residual stresses and case depth from a combination of Barkhausen noise, tangential magnetic field, and hysteresis measurement by use of artificial neural networks [12]. Sorsa et al. compared the suitability of linear regression, artificial neural networks, and fuzzy models for determination of residual stresses from Barkhausen noise measurements. They identified better performance indices RMSE (root mean square error) and R 2 (coefficient of determination) for artificial neural networks than for linear regression but also pointed out an advantage of linear regression. Indeed, this method has a low risk of overfitting and is therefore well suited for small calibration data sets and extrapolation. To achieve good regression results, a preselection of significant features was performed, as a too high number of inputs led to overfitting of the network [13]. Furthermore, in [10] only a single digit number of measured variables was used for hardness determination. Criterion for the selection was the ratio of standard deviation and mean value. An improvement of the performance could be achieved by additional measured variables. The influence of the volume of the data set is described in [11]. An extension of the calibration data set leads to decrease of the standard error. For the given example of 17 measurement points over a Jominy sample with three hardness zones, the standard error starting with nine data points in calibration was below 5% of the maximum hardness.
Further actual studies handle with the correlation of magnetic properties with hardness, residual stresses, and other properties. In general, decreasing hardness [14] and increasing (tensile) residual stresses [15,16] led to an increase in the Barkhausen noise level. While increase of Barkhausen noise with residual stresses in some cases is not linear due to poor magnetization, the averaged permeability derived from hysteresis loop shows a much better linear correlation [16]. Other works describe linear increase with rising tensile stresses for Barkhausen noise as well as permeability [17,18]. Besides the maximum and average values of Barkhausen noise and permeability, various other measured variables are determined from the raw signal depending on the measuring device. The coercivity develops roughly opposite to the Barkhausen noise or permeability level and correlates well with the hardness as well as residual stresses [17,19]. Peak width decreases with increasing tensile stresses. Remanence decreases with increasing hardness [19].
The aim of the present work is the non-destructive evaluation of the heat treatment state of case-hardened steel workpieces by means of micromagnetic 3MA-measurements with variation of magnetization and analysis frequencies. Instead of the first mentioned often very complex approaches for the evaluation of frequency sweeps, different calibration strategies were developed. In order to use the possibilities of frequency-dependent analyses for the determination of gradient properties, different from previous works, a large number of measured variables were included. After calibration using samples with known properties, these can simply be applied to the following measurements. For this purpose a comparison of classical regression analysis and calibration using artificial neural networks was carried out. A sample set with two-stage variation of three variables and two samples per state was prepared for calibration and analysis. Special attention was therefore paid to the challenges of-compared to number of measured variables and influences on heat treatment state-small calibration data sets. Standard errors of calibration data and unknown test data were compared. Possible influences of different parameters on the quality of the result were evaluated and compared.

Micromagnetic Measurements and Data Evaluation
The 3MA-II-technology (Micromagnetic Multiparametric Microstructure-and Stress-Analysis) developed by Fraunhofer Institute for Nondestructive Testing (IZFP) combines the four micromagnetic methods Barkhausen Noise (BN), Harmonic Analysis of the tangential magnetic field strength (HA), Incremental permeability (IP), and multi-frequency Eddy Current analysis (EC). Together these methods provide 41 measured variables with different sensitivity to microstructure and stress state. Due to the frequency ranges the methods have different analyzing depths. The combination of measuring parameters allows a separation of different influences such as residual stresses and hardness as well as the compensation of disturbances [20]. An additional software tool, the so-called sweep module, enables the automatized and fast variation of measurement parameters such as frequency and magnetization amplitude [4,9].
BN results from the stepwise shift of Bloch walls, separating magnetic domains, during magnetization of ferromagnetic materials. If no further Bloch wall shifts are possible, the domains are aligned in direction of the field by rotation processes as the magnetization continues to increase [21]. In this region the magnetic hysteresis loop, B(H) becomes flatter until it becomes horizontal in the saturation state [22]. A schematical hysteresis curve is shown in Figure 1. After amplifying, filtering, and rectifying the recorded BN signal, its envelope or profile curve is displayed above the magnetic field strength H. The shape of the hysteresis curve, the profile curve, and thus the characteristic parameters are influenced by the microstructure, the mechanical properties, and residual stress state [9,20]. Mechanically hard materials are magnetically hard with high coercivity H CM and low remanence M R and maximum Barkhausen noise amplitude M MAX because Bloch wall shifts interact analogous to dislocation movements. Other measured variables are the averaged amplitude over one magnetization cycle M MEAN and the curve width at 25%, 50%, and 75% of M MAX called DH 25M , DH 50M , and DH 75M , respectively. Compressive stresses cause a high coercivity and low amplitude, tensile stresses a low coercivity and high amplitude [4,[23][24][25]. Depending on the frequency, different depth ranges can be investigated. According to (1), the exponential damping function results in the penetration depth at which the amplitude of the magnetic field is still 1/e of the intensity at the surface. Thereby, the relative permeability µ r and the electrical conductivity σ are material and heat treatmentdependent parameters [26]. The high-pass frequency of the Barkhausen noise signal can be stepwise varied between no limitation and 1 MHz, and low-pass between 500 kHz and no limitation. Due to the base magnetization frequency between 10 Hz and 1000 Hz, harmonic analysis has the largest possible analyzing depth of the four testing methods with up to several mm probing depth [4]. The analyzing depth of the Barkhausen noise, depending on the analyzing frequency range, is in a range from about 10 µm to a few 100 µm [9]. In this particular case of hardened surface layers, an analyzing depth of up to 50 µm can be assumed [27]. The analyzing depth of the incremental permeability is in a similar range [9]. Depending on the set frequencies the eddy current analysis can reach large range of depth from about 10 µm to few hundred microns.
In the harmonic analysis of the tangential magnetic field strength, the sample volume to be examined similar to Barkhausen noise analysis is magnetized by a sinusoidal alternating field. The development of the tangential field strength during the hysteresis loop is recorded by a Hall probe and processed by Fourier analysis. In this way, the odd higher harmonics can be determined. Measured variables generated from this are the amplitudes A 3 . . . 7 and phases P 3 . . . 7 of the harmonics, the distortion factor K(2), the sum of all upper harmonics UHS, the coercive field strength H CO of the harmonic analysis (increases with hardness), the harmonic content of the magnetic field strength at zero crossing H r0 , and the final stage voltage of the electromagnet V mag [4] . Soft magnetic materials generally lead to a high distortion factor and vice versa [4].
To determine the incremental permeability µ ∆ = dB dH , the hysteresis loop is superimposed by a further higher-frequency hysteresis loop. The signal picked up in this process corresponds to the incremental permeability, which is determined by alternating field strength changes at various points in the hysteresis loop, which would require an enormous amount of time in practice. The incremental permeability µ plotted against the magnetic field strength results in a similar curve shape as the Barkhausen noise M and the determination of the measured variables is carried out in the same way. While the Barkhausen noise is based on irreversible Bloch wall jumps, the incremental permeability depicts reversible processes [8].
In eddy current analysis, a high-frequency alternating weak magnetic field is generated. This primary field generates electric eddy currents, which are accompanied by a secondary field in the opposite direction to the primary field. A receiver coil measures the magnetic field as induced voltage. The eddy currents and thus the induced voltage are influenced by both the conductivity and permeability of the sample. The 3MA-II technology allows the simultaneous use of four eddy current frequencies. Real parts Re 1 . . . 4 and imaginary parts Im 1 . . . 4 as well as magnitude Mag 1 . . . 4 and phase angle Ph 1 . . . 4 of the voltage are output as measured variables [4].

Regression Analysis
For quantitative determination of a target quantity (e.g., hardness) it is necessary to find a calibration function which describes the relationship between target quantity and variables. Because the theoretical calculation of this relationship with physical models is possible at least for simple materials, 3MA is usually calibrated based on empirical data [4]. For this purpose, the target values are determined with a reference method (e.g., hardness testing) and stored in a database together with the micromagnetic measurements to analyze correlations by use of regression analysis or pattern recognition. Possible terms of the regression are the measurement parameters of 3MA as measured, their squares and square roots. The coefficients are determined by method of least squared errors [4,8,28]. Measures for the goodness of a regression are the coefficient of determination R 2 and the root mean square error RMSE [28].
An adjustment of the regression algorithm makes it possible to minimize the effect of drifts of measured variables on the regression result. This is of special interest for low-frequency and systematic stochastic errors, for example due to ageing or wear of the sensor, that cannot be reduced by a larger number of measurements in the calculation of averages. The range of values W gives the minimum and maximum value of a partial expression in the calibration data set.
(3) gives the modulation M and (4) the 1% error effect F 1 . If the measured value changes by 1% of W max the regression result changes by the value F 1 . The larger the error effect, the greater the reproducibility of the measured variables. The calibration wizard of the 3MA-MMS software gives the opportunity to choose the highest tolerated F 1 . This limitation reduces the possible expressions to those that fulfill the condition F 1 < F 1,max . A too strong limitation of the 1% error effect leads also to a decreasing coefficient of determination R 2 and a rising standard error RMSE, as the remaining terms have not only a small error effect but only a small effect at all [8].

Artificial Neural Networks
Artificial neural network (ANNs) consist of several interconnected processing units called neurons, which can be divided in input, hidden, and output units and are arranged in layers (Figure 2a). The input units of the first layer distribute the components of the net input vector to the units of the second layer. The second till second last layer are hidden layers [29,30]. Every neuron consists of data collection (neuron input), processing, and sending results (neuron output), as shown in Figure 2b. To understand the process it is important to differentiate between the net-input or -output ( Figure 2a) and the neuroninput or -output ( Figure 2b). The weights of the links control the effect of the inputs on a neuron. They are changed during training of the ANN to optimize the relation of input and output and carry the information of the ANN. The weighted neuron inputs are summed and the transfer function determines the neuron output. There are several possible linear and nonlinear transfer functions to choose when constructing the network [31,32]. For training of an ANN (i.e., optimization of the weights) there are three general learning strategies, the supervised, unsupervised, and reinforcement learning. For supervised learning, the training set consists of (net) inputs and the related outputs. In unsupervised learning, only (net) inputs are given and the network learns without intervention of the trainer. Reinforcement learning means that the trainer only indicates whether the output of the network is correct or not. To evaluate the quality of the trained network in a test phase with unknown input, the pattern set is split up into teaching set and test set before teaching phase [29]. Furthermore it is necessary to normalize the net input and output values into the range 0-1 before teaching to get similar ranges for all variables [31,32].
A commonly used training algorithm for nets with input layer, output layer, and hidden layers is the backpropagation algorithm. Every teaching step consists of a forward pass and a backward pass. The forward pass starts with the input layer, the output of every unit is transmitted to the next layer till the output layer is reached. The estimated net outputs y k are compared with the correct outputs y k . Through the backward pass the differences are stepwise given back from the output layer to the first hidden layer and the weights are dated up to minimize the error of the next teaching step [31,32].

Materials and Methods
The sample set consists of 54 discs made from one batch of steel AISI 4820 (DIN 18CrNiMo7-6) with a diameter of 68 mm and a thickness of 20 mm. The chemical composition of the steel batch was analyzed with an optical emission spectrometer ARL 3460 (Thermo Fisher Scientific, Waltham, MA, USA) and is given in Table 1. The samples were gas carburized and oil quenched in 27 variations given in Table  2. The heat treatment was performed in a chamber furnace (Aichelin Holding GmbH, Mödling, Austria). Details for case hardening are shown in Figure 3 and Table 3. After oil quenching, the samples were tempered for two hours at the temperatures given in Table 2.   Table 3. Surface hardness and carburization depth, as an approximation of the case hardening depth (CHD), were determined at smaller coupon samples treated in the same heat treatment batches as the samples for the investigations. Surface hardness was measured according to Vickers with a LV-700AT (LECO Instrumente GmbH, Mönchengladbach, Germany) with a test load of 9.807 N (HV1). To determine the carburization depth, carbon depth profiles were recorded using spark optical emission spectroscopy (ARL 3460, Thermo Fisher Scientific, Waltham, MA, USA). The carburization depth is defined as the depth with a carbon content of 0.3 wt. %.
Micromagnetic measurements were performed with a 3MA-II device from Fraunhofer IZFP, Saarbrücken, Germany and a standard sensor with convex pole shoes and spring-mounted transducer unit. The measurement settings are given in Table 4. The magnetization frequency, high-pass frequency, and eddy current frequency of the incremental permeability IP were varied with the sweep module to record a total of 425 measurement quantities with different analyzing depths. All samples were measured ten times at one position of the circumference with magnetization in tangential direction. After a first check of the data set and removal of samples with obvious outliers, calibration was carried out using the calibration module of the 3MA software and also by artificial neural networks. For calibration with linear regression analysis, the data obtained with magnetization frequency of 20 Hz were removed from the data set. For this frequency, instability of the magnetization meant that several measurement parameters of the IP could not be determined consistently. As this would reduce the information content of the data set, as only complete measurements with all parameters are used for regression, the magnetization frequency was removed from all data sets.
Linear regression with the "calibration wizard" of 3MA-software (Fraunhofer IZFP, Saarbrücken, Germany) [4,8] was carried out with carburization depth and hardness as target values. The maximum number of terms of the polynomial was set to 10. As mentioned in chapter 1, the limitation of the maximum error effect leads to an improvement especially for low-frequency stochastic errors (e.g., due to wear/ageing of the sensor). Nevertheless the maximum error effect was varied here to identify possible effects, e.g., in context of deviations due to contact between sensor and sample. Maximum error effect was reduced stepwise to find a setting with reduced error effect without too strong worsening of the regression quality. For both, the unlimited and the optimized error effect, additionally to these of the calibration data set R 2 and RMSE of test samples, which were not included in the calibration/training, were evaluated.
When dividing the data set into test and calibration data set, a too small calibration data set would lead to a bad regression result. In contrast, a large calibration data set reduces the test data set and thus the reliability of the control. Therefore, for the maximization of calibration and test data set the autorecognition test described in [8] was used. Each of the k samples (10 measurements) is taken out of the calibration data set and used for validation one time. After k calibrations with k − 1 samples, every sample is used for validation one time.
For the artificial neural network analysis, 425 input neurons were employed as well as the same number of hidden neurons and output neurons with carburization depth and hardness as net output. The net was constructed in the software MemBrain [33,34] (free version for non-commercial and educational use, Thomas Jetter, Mainz, Germany). The activation function (transfer function) is a logistic function. The net was taught/trained with backpropagation method over 30 and 60 repetitions of the teaching lesson. Results were evaluated in the same way as described for regression analysis.
To get an idea about advantages and disadvantages with the use of sweeps, additional calibrations were performed with only the measured variables of the standard configuration. Linear regression was performed without limitation of the maximum error effect and network training with 30 repetitions.

Results and Discussion
As a result of the heat treatment variations described in Table 2, material states with carburization depth of approximately 0.55 mm, 0.9 mm, and 1.9 mm and surface hardness between 640 HV1 and 760 HV1 were generated. All measured data are presented in Table A1.

Calibration by Use of Regression Analysis
Linear regression of the hardness without limitation of the maximum error effect leads to an error effect of F 1 = 9.274 HV1. The determination coefficient of this regression is R 2 = 0.8709 and the standard error RMSE = 12.919. Table 5 shows the change of determination coefficient and standard error due to limitation of the error effect. A limitation to F 1 = 3 HV1 was chosen as optimum since with a stronger limitation there would be a too strong decrease of the determination coefficient and increase of the standard error. Micromagnetic determined hardness of the calibration data set plotted over the measured hardness is shown in Figure 7. There is no pronounced difference due to the limitation of the error effect. The plotted data points all represent an averaged value for the 10 micromagnetic measurements per sample and the associated standard deviation. This average is the reason for the difference between the standard errors given in Table 5     The terms of the regression are summarized in Table 6. No measured parameters from Eddy Current analysis is part of the calibration function but values from Harmonic Analysis (A 3 -A 7 , P 3 -P 7 , V mag , H co) , Barkhausen noise (M MAX , H CM , M r ), and Incremental Permeability (DH25 µ , µ r ) with different frequencies. As the values for the regression analysis are not normalized, the regression coefficients do not allow any conclusion about relevance of the single terms. It is obvious that some measured variables appear with a positive and negative prefix. For example, M MAX is part of the regression result as negative term, what corresponds to the state of the art, but later M MAX 2 appears as positive term. H CM 2 is part of the result as negative term whereas it typically increases with hardness. Plotting the hardness over the various measured variables shows the best correlation with the remanence of the incremental permeability measurement (Figure 9). However, no obvious relationship between hardness and the measured variables from Harmonic Analysis could be observed. The different suitability of the measurement methods for determining hardness can be explained by the different penetration depths depending on the frequency. An overview of correlation of different measured variables with hardness and carburization depth is given in Table A2. The comparison with Table 6 and Table  9 shows that not all terms of the regression result are visibly correlated with hardness respectively carburization depth. On the one hand, this indicates that the calibration would be possible also with a lower number of terms. On the other hand, due to variation of surface carbon content, case hardening depth, and tempering temperature, signals are affected by multiple influences. Measured variables with coefficient of determination of more than 0.25 are remanence of Barkhausen noise and incremental permeability which correlate negatively with the hardness and peak width at 25% of the maximum of the incremental permeability. This is consistent with the known general correlations. To make sure that only relevant measurement parameters are included in calibration the number of possible terms was limited to less than 10 ( Table 7). A stronger limitation of the number of regression terms leads to a decrease of the coefficient of determination and an increasing standard error. But at the same time there is a clear decrease of the error effect. Regression with a high number of terms fits very well to the teaching data set but does not necessarily lead to an improvement on the result for unknown test data (exemplary shown for 20, 10, and 6 terms). A limitation of the number of terms thus has a similar effect as the limitation of the maximum error effect. The most important measured value, which is used to calculate hardness with only two terms is the remanence of the incremental permeability µ r . Additionally, a comparison of Figure 9a,b shows no difference in the qualitative evolution for different frequencies. A limitation to fewer frequency variants or a single frequency should therefore be possible without negative effects on the calibration but would reduce the measurement and calibration effort. Furthermore, this is in agreement with the results of previous studies, where calibration was carried out with only few preselected variables.
Similar to the hardness, the measurements were calibrated for carburization depth. Table 8 again shows the regression quality in dependence of the maximum error effect. A maximum error effect of 0.06 mm was chosen as optimum.  Figure 10 shows the carburization depth calculated without (a) limitation of the error effect plotted over the measured carburization depth. For this calibration data set there is a good agreement of measured and calculated values with a standard error of 0.074 mm, what is~6% of the range of target values. The limitation of the error effect (b) only slightly affects the goodness of the prediction since small increase of RMSE is resulting. The same calibration types with and without limitation of the error effect in Figure 11 are illustrated for the test data set. The standard errors again are obviously higher than those of the calibration data set but with these are still within an acceptable range of reliability with errors around 10% of the target values. Without taking into account the outlier at 1.87 mm, the RMS is 0.105 mm for only 8% of the range of target values. As for the calibration data set, no pronounced effect due to the limitation of the maximum error effect can be observed, and the standard error increases slightly. The regression terms for the determination of carburization depth are shown in Table  9. Again, the function contains measured variables from different methods and frequencies.
Compared to the hardness determination the importance of Harmonic analysis and low frequencies (larger penetration depth) increases. Measured variables with best coefficient of determination (see Table A2) are K, H cm , and H co which all correlate positive with carburization depth. This corresponds to the general relationship that these measured variables increase with hardness. Again, a stronger limitation of the number of regression terms leads to lower coefficients of determination and lower error effects (Table 10). To calculate the carburization depth of unknown samples, the function with 10 terms gives good results but a limitation on six or eight regression terms could also be used, as a good compromise between the standard error and the maximum error effect. Similar to the observations of Table 9, regression results with six or less terms are based on measured variables from harmonic analysis and the coercivity H cm (80 Hz).

Calibration by Use of Artificial Neural Networks
The output of the artificial neural network plotted over the measured hardness is shown in Figure 12 after 30 (a) and 60 (b) iterations of the teaching lesson. The standard error is much lower than after calibration with linear regression and decreases significantly with duplication of the number of teaching lessons. RMSE = 7.1 HV1 after 30 teaching lessons are in the same range as the standard deviation of Vickers hardness measurements so as the RMSE = 3.7 HV1 after 60 lessons. For the unknown test data (autorecognition test), the standard error between measured hardness and net output is much higher than for the training data (see Figure 13). It is slightly below than after regression analysis but the difference between teaching and test data set is significantly higher. The duplication of the number of training lessons does not lead to pronounced change of the accuracy with a slight increase of the standard error. This illustrates the risk of overfitting. If the network adapts too much to the training data, this is at the expense of the quality of the prediction for unknown test data. For optimization, the number of teaching lessons should therefore be increased stepwise. As the standard error of the training data set continues to decrease or gets towards an asymptote, the standard error of the test data set will rise if the net overfits the training data. With the size of the network, the ability to fit complex solutions increases as well as the risk of overfitting [35]. Therefore, besides the number of training steps, the number of input variables also offers optimization potential. After doubling the training steps for calculation of hardness no improvement of the results was resulting. Figure 14 shows therefore only the network output for carburization depth after 30 repetitions of the training lesson. Nevertheless for the mentioned optimization the influence of the number of training steps on both target values should be examined more in detail. Despite some outliers at a carburizing depth of 0.9 mm, this is the lowest standard error. At the same time, the standard error increases more than three times from training to calibration data set. Possible reasons for this and potential improvements have already been mentioned for hardness. In addition, an extension of the sample set by further carburizing depths between 1 mm and 2 mm or more than 2 mm could be useful, to extend the range of the training data.

Calibration with Standard Configuration and Variation of Measurement Parameters
For comparison between calibration results for measurements with standard configuration (41 measured variables) and with use of sweeps (425 measured variables) the validation strategy was changed. In order to reduce the validation effort, autorecogni-tion test was performed only for 14 samples (one sample of each of the states marked bold in Table A1). The deviation between the standard errors in Figure 15b, Figure 7a, and Figure 8a illustrates the problem of small test data sets. Some samples that led to outliers before are not part of the reduced test data set. Therefore, it is important that only standard errors based on the same test data set ( Figure 15; Table 11) are compared with each other. With use of the frequency sweeps the standard error for hardness determined by linear regression decreases for calibration and test data set.   Table 11 summarizes standard error of hardness and carburization depth for calibration with linear regression and ANN with standard configuration and use of sweeps. In general, the variation of the measurement parameters (sweep) leads to a significant reduction of the standard error. This reduction is particularly pronounced for the determination of the carburization depth by use of ANN. For determination of the hardness by use of ANN, only the standard error of the calibration data set decreases while that of the test data set increases. This is attributed to the already mentioned problem of overfitting. Table 12 gives an overview of the different types of calibration (with use of sweeps) for prediction of the hardness and the case hardening depth. While there are large differences in the standard error of calibration data set, the quality of calibration for the test data set is slightly better for the ANN method, than for linear regression. It is noticeable that compared to the range of target values the accuracy of the hardness determination is generally lower than that of carburization depth, which means that unaccounted interfering variations of material properties may have influence on the results of the calculations. In contrast to the linear regression analysis, the ANN does not allow acquiring information about how the measured variables are included in the result, which means that it is a black box. For the mentioned reduction of the input variables, the knowledge from the regression analysis can provide an approach. For the analysis of the hardness, it could be observed that parameters with low penetration depths, as in Incremental permeability, are of great importance. To determine the carburization depth, methods and parameters with greater penetration depths such as harmonic analysis with low magnetization frequency are required. Additionally the comparison in Chapter 3.3 has shown an improvement of calibration results thanks to the use of frequency sweeps. Therefore, it seems to be useful to use more than one magnetization frequency for determination of carburization or case hardening depth, as data related to the property gradients are recorded and analyzed. For determination of surface hardness this brings no significant benefit. To avoid overfitting and reduce the measurement and calibration effort, the very large number of used frequency variants should be reduced also for carburization depth by choosing few relevant frequencies over the whole range. Apart from the selection of suitable measurement methods and frequencies depending on the target evaluation of eddy current results could be skipped for this application. Furthermore, a preselection of measured variables as described in previous work is recommended to reduce the time for calibration and the risk of overfitting of ANN. Possible criteria are the correlation of the single terms with the target value and low standard deviations-especially for similar measured variables like from Barkhausen noise and incremental permeability. Another way to reduce the complexity of linear regression would be to exclude roots and squares of the values, but it is expected that the goodness of fit will also decrease Nevertheless, with the used strategies, hardness standard errors under 3% of the maximum target value could be achieved. Different from most of the previous studies, in the used dataset three variables (surface carbon content, case hardening depth, and tempering temperature) were used for the variation. Limitations of the maximum error effect and number of terms for linear regression and number of training steps for ANN were identified as options for optimization. For practical use, a stronger limitation seems to be useful in order to reduce the influence of small deviations, for example, due to inconsistent sensor contact. The chosen values for number of terms, maximum error effect and regression steps show opportunities but have to be optimized for each specific application.

Conclusions
This work has shown and compared the opportunities of calibration with linear regression and artificial neural networks for the non-destructive determination of hardness and carburization depth by micromagnetic measurements. The standard error RMSE of the calibration and test data set was used as indicators for the quality of the calibration. Even if there were larger differences in the standard errors of the calibration data sets, the standard errors of the test data sets were generally higher than in calibration set but comparable for all calibration strategies.
The best results for investigation of unknown samples were achieved with an artificial neural network and a not too high number of teaching steps, which limited the effect of overfitting in the calibration data set. Further potential for improvement lies in the optimal selection of the number of teaching steps and input variables. The last point also applies to the calibration with linear regression as well as the extension of the sample set. An increase or decrease of the number of terms in the calibration function by linear regression can improve/degrade the standard error, but also affects the maximum error effect. Therefore, optimum has to be determined based on the available data set. Great potential also lies in the targeted selection of measured variables. While here a number of 425 variables was used, a limitation can reduce the measurement and calibration effort as well as the risk of overfitting. For further studies, the sample set should also be extended to more than one steel batch in order to include the batch influence in the calibration.
Author Contributions: Conceptualization, methodology, formal analysis, investigation, writingoriginal draft preparation, and visualization, R.J.; writing-review and editing, supervision, and project administration, J.E. All authors have read and agreed to the published version of the manuscript. Acknowledgments: The scientific work has been supported by the German Research Foundation (DFG) within the research priority program (SPP) 2086 for project EP128/3-1. The authors thank the DFG for this funding and intensive technical support.

Conflicts of Interest:
The authors declare no conflict of interest. Table A1. Measured surface carbon content, carburization depth and hardness of the 27 heat treatment variations; bold printed results were used for the validation in Figure 15 and Table 11.