In this section, we will present the implementation of the two methods proposed.
3.2. Feature Selection for the Grid Search Method
To facilitate the selection of relevant features from a dataset, a Python 3.9.13-based analytical script was developed in ref. [
9]. The primary objective of this script is to identify and retain only those features that contribute meaningfully to the classification task, while discarding those that are predefined or irrelevant, such as object age or the number of radar cycles. The script begins by reading the dataset and applying a normalization technique using the MinMaxScaler. This transformation scales all feature values to a range between 0 and 1, thereby ensuring that the neural network does not disproportionately favor features with inherently larger numerical values.
Subsequently, the script evaluates feature importance by analyzing the statistical distribution of each feature across different object classes. For each class, the mean value of every feature is computed. The absolute differences between the mean of a given class and the means of the other classes are then calculated. Given the presence of three classes: pedestrian (class 1), cyclist (class 3), and car (class 5), three pairwise comparisons are performed: 1–3, 1–5, and 3–5. The mean of these three absolute differences is then computed for each feature, providing a scalar measure of its discriminative power across all classes.
For interpretability, the resulting mean values are scaled to a range of 0 to 100, yielding a normalized feature importance score. Additionally, a ranking mechanism is applied based on these scores to further elucidate the relative contribution of each feature to the classification task.
For instance, the feature “width” has a low absolute mean difference of 5.18 between classes 1 and 3, indicating limited utility in distinguishing pedestrians from cyclists. However, the same feature shows significantly higher differences in the 1–5 (142.05) and 3–5 (136.87) comparisons, suggesting its effectiveness in differentiating cars from the other two classes. Despite this, the overall rank of “width” is relatively low (rank 44), implying that other features, such as “length” (rank 8), offer more consistent discriminative power across all class comparisons. An example can be observed in
Table 1, which is a fragment of the original table, an excerpt from the resulting table used for illustrative purposes.
After applying the above method, the following 29 features were found:
- ○
corrXAccelerationAbsYAccelerationAbs, the correlation between the absolute accelerations along the X and Y axes. It provides information about coordinated movement patterns of the object in the horizontal plane.
- ○
corrXPosXAccelerationAbs, the correlation between the X-position and the absolute X-acceleration of the object. It is useful for understanding how positional changes relate to acceleration behavior along the longitudinal axis.
- ○
corrXPosXVelocityAbs, the correlation between the X-position and the absolute X-velocity. It helps assess the consistency of motion along the X-axis.
- ○
corrXPosYPos, the correlation between the X and Y positions of the object, offering insights into the trajectory and directional movement.
- ○
corrXVelocityAbsXAccelerationAbs, the correlation between absolute X-velocity and absolute X-acceleration, which is important for analyzing acceleration trends relative to speed.
- ○
corrXVelocityAbsYVelocityAbs, the correlation between absolute velocities along the X and Y axes, indicating the movement in the longitudinal and lateral space for the object.
- ○
corrYPosYAccelerationAbs, the correlation between Y-position and absolute Y-acceleration, providing information on lateral motion dynamics.
- ○
corrYPosYVelocityAbs, the correlation between Y-position and absolute Y-velocity, which is useful for evaluating lateral movement consistency.
- ○
corrYVelocityAbsYAccelerationAbs, the correlation between absolute Y-velocity and absolute Y-acceleration, offering information about lateral acceleration behavior.
- ○
height: This feature denotes the estimated height of the object, contributing to its geometric characterization.
- ○
stdDevHeight, the standard deviation of the object’s height, indicates variability in vertical dimensions and potential object instability.
- ○
stdDevLength, the standard deviation of the object’s length, which reflects changes in perceived object size due to motion.
- ○
stdDevWidth, the standard deviation of the object’s width, provides information about the changes in object’s geometric configuration.
- ○
stdDevXAccelerationAbs, the variability in absolute X-acceleration, which is important for assessing dynamic behavior along the longitudinal axis.
- ○
stdDevXPos, the standard deviation of the X-position, indicating the spread or uncertainty in longitudinal positioning.
- ○
stdDevXVelocityAbs, the standard deviation of absolute X-velocity, offering information about speed consistency along the X-axis.
- ○
stdDevYAccelerationAbs: This feature measures the variability in absolute Y-acceleration, which is relevant for analyzing lateral dynamic behavior.
- ○
stdDevYPos: This feature captures the standard deviation of the Y-position, indicating lateral positional dispersion.
- ○
stdDevYVelocityAbs: This feature represents the standard deviation of absolute Y-velocity, providing information on lateral speed variability.
- ○
xAccelerationAbs: This feature denotes the absolute acceleration along the X-axis, which is useful for understanding way of motion.
- ○
xVelocityAbs: This feature captures the absolute velocity along the X-axis, representing the object’s longitudinal speed.
- ○
yAccelerationAbs: This feature denotes the absolute acceleration along the Y-axis, which is important for lateral movement analysis.
- ○
yVelocityAbs: This feature represents the absolute velocity along the Y-axis, indicating the object’s lateral speed.
- ○
length: This feature estimates the length of the object, contributing to its geometrical characterization.
- ○
width: This feature denotes the width of the object, providing information about the object’s width.
- ○
xyAccAbs: This feature represents the combined absolute acceleration in the X and Y directions, providing a view of the object’s dynamic behavior.
- ○
xyVelAbs: This feature captures the combined absolute velocity in the X and Y directions, offering a measure of object speed.
- ○
stdDevxyAccAbs: This feature measures the standard deviation of the combined absolute acceleration, indicating variability in overall dynamic behavior.
- ○
stdDevxyVelAbs: This feature represents the standard deviation of the combined absolute velocity, providing insights into the consistency of the object’s overall motion.
Features such as the “Object Classification Hypothesis” and “Object ID Hypothesis” were excluded because they do not provide informative value for the analysis but are instead used as identifiers for the object ID and the object class.
3.3. Feature Selection for the Genetic Algorithm Method
Reducing the number of input features is an important consideration in the design of neural network models, as it enables the use of deeper and more complex architectures while maintaining computational efficiency. A more compact feature set decreases the dimensionality of the input space, thereby reducing the number of operations required per inference and minimizing memory consumption. For the Genetic algorithm-based method, we focus on utilizing a reduced set of 16 features, selected based on their ability to effectively characterize and discriminate between object classes. This approach enhances the computational performance of the model and also contributes to improved classification accuracy by eliminating redundant or non-informative inputs. Consequently, the network can allocate more representational capacity to learning salient patterns relevant to the classification task.
For the feature selection process, we analyzed the correlation matrix using Pearson’s correlation coefficient. Pearson’s technique measures the linear relationship between two variables, providing a value between −1 and 1. A value of 1 indicates a perfect positive linear relationship, −1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship. This method is instrumental in identifying and selecting features that have significant correlations, thereby enhancing the predictive power of the model.
Figure 4 represents the correlation matrix for all features in our dataset. Just like in method 1, features such as the “Object Classification Hypothesis” and “Object ID Hypothesis” were excluded. It is evident that some features exhibit strong correlations, indicating redundancy. Redundant features do not contribute new information to the model, as they are highly correlated with other features. In this phase of the analysis, feature selection was conducted by identifying and retaining those that have minimal redundancy with other features. This approach allows for a more nuanced selection process, emphasizing interpretability and predictive value. As a result, the following 16 features were selected, among which 3 of them are common to the previous method (xyAccAbs with Overground Acceleration Hypothesis, xyVelAbs with Overground Speed Hypothesis, length with Object Length Hypothesis):
Overground Acceleration Hypothesis: This feature represents the acceleration of the object over the ground. It is crucial for understanding the dynamics and movement patterns of the object.
Kalman Filter Heading Variance: This feature measures the variance in the object’s heading as estimated by the Kalman filter. It is important for assessing the stability and accuracy of the object’s directional movement.
Kalman Filter X-axis Velocity Variance: This feature captures the variance in the object’s velocity along the X-axis, providing insights into the object’s speed and movement consistency.
Kalman Filter Y-axis Velocity Variance: Similarly to the X-axis velocity variance, this feature measures the variance in the object’s velocity along the Y-axis.
Kalman Filter Yaw Rate Variance: This feature represents the variance in the object’s yaw rate, which is essential for understanding rotational movements and changes in direction.
Object Length Hypothesis: This feature estimates the length of the object, which is important for size characterization and spatial analysis.
Length of Radar Target Cloud Hypothesis: This feature measures the length of the radar target cloud, providing information on the spatial extent of the detected object.
Overground Speed Hypothesis: This feature represents the object’s speed over the ground, which is crucial for movement analysis and prediction.
Width of Radar Target Cloud Hypothesis: This feature measures the width of the radar target cloud, contributing to the spatial characterization of the object.
Range Variance of Raw Targets: This feature captures the variance in the range measurements of raw targets, providing insights into the object’s distance and positional accuracy.
Mean Signal-to-Noise Ratio of Raw Targets: This feature represents the average signal-to-noise ratio of raw targets, which is important for assessing the quality and reliability of the radar data.
Speed Variance of Raw Targets: This feature measures the variance in the speed of raw targets, contributing to the analysis of movement consistency.
Y Position Variance of Raw Targets: This feature captures the variance in the Y position of raw targets, providing spatial accuracy information.
Filtered Radar Cross Section (RCS): This feature represents the filtered radar cross section of the object, which is crucial for understanding the object’s reflectivity and radar signature.
Object Area Hypothesis: This feature estimates the area of the object, contributing to size characterization and spatial analysis.
Signal-to-Noise Ratio Variance of Raw Targets: This feature measures the variance in the signal-to-noise ratio (SNR) of raw targets. SNR is a critical metric in radar systems, as it quantifies the ratio of the signal power to the noise power. A higher SNR indicates a clearer and more distinguishable signal from the background noise. The variance in SNR provides insights into the consistency and reliability of the radar measurements, which is essential for accurate object detection and tracking.
Figure 5 shows the distribution of classes within the whole dataset across all the features, utilizing Principal Component Analysis (PCA). PCA is a statistical technique that transforms the original features into a set of linearly uncorrelated components, ordered by the amount of variance they capture. This method is particularly advantageous for data visualization as it reduces dimensionality while preserving the most significant variance in the data. By visualizing the data in this reduced space, we can better understand the separation of classes, which is crucial for evaluating the performance of future neural network models. Effective class separation in the PCA-transformed space often correlates with improved model performance, as it indicates that the features are informative and discriminative.
Figure 5 presents the PCA plot for the features utilized in the neural network. The analysis of the two PCA plots in
Figure 5 reveals distinct patterns in the distribution of classes, which are expected to influence the performance of the neural network on the test dataset:
Car Class: The car class exhibits a well-defined distribution, as evidenced by the clear separation in the PCA plots. This distinct clustering suggests that the features associated with cars are highly informative and discriminative. Consequently, the neural network is likely to achieve high performance in classifying cars within the test dataset.
Cyclist Class: The cyclist class demonstrates a poorly defined distribution, with significant overlap and intercalation with the pedestrian and car classes. This lack of clear separation indicates that the features for cyclists are less distinctive, leading to potential misclassification. As a result, the neural network may struggle to accurately distinguish cyclists from other classes, negatively impacting its performance on the test dataset. However, the distribution is improved after reducing the number of features—see
Figure 5b.
Pedestrian Class: The pedestrian class shows a better-defined distribution compared to the cyclist class, with slightly improved separation from the other classes. This suggests that the features for pedestrians are more consistent and discriminative, albeit not as distinct as those for cars. Therefore, the NN is expected to perform moderately well in classifying pedestrians, with better results than for cyclists but not as high as for cars.
Figure 6 illustrates the importance of features utilized in the neural network, as determined by Principal Component Analysis (PCA) loadings. PCA loadings represent the coefficients of the original features in the principal components, indicating how much each feature contributes to the variance captured by the components. Features with larger absolute values of loadings are considered more important because they contribute more significantly to the principal components. For example, the length of the radar target cloud (RTC) is a highly significant feature in the context of PCA due to its direct influence on the variance structure of radar-derived datasets. The RTC length serves as a key descriptor because it captures the spatial extent of the radar returns corresponding to a target object. One of the primary reasons for RTC length’s importance lies in its inherent variability across different objects. Since objects vary considerably in their physical dimensions and geometric shapes, the length of their corresponding radar target clouds also varies significantly. This variation effectively reflects intrinsic differences among objects, making RTC length a powerful feature for distinguishing between target classes. From a practical standpoint, the ability to measure the length of the radar target cloud provides an intuitive and straightforward approach for categorizing objects. Length is a fundamental geometric attribute that often correlates with the physical size of the target itself. Objects with larger physical dimensions typically generate longer radar target clouds, while smaller objects produce shorter clouds.
An additional analysis was conducted to support the findings obtained from the Pearson correlation heatmap. For this purpose, SHAP (SHapley Additive exPlanations) analysis was implemented to provide a deeper understanding of feature importance and individual feature contributions to the model’s predictions, the power of SHAP being described in ref. [
24]. The SHAP summary plot visualizes the impact of each feature on the output of the model across the dataset.
Figure 7 illustrates the SHAP feature importance analysis, highlighting the relative contribution of each input variable to the model’s classification performance across object classes such as Car, Pedestrian, and Cyclist. The most influential feature identified is the Object Area Hypothesis, which demonstrates a strong discriminative capability due to the inherent differences in object dimensions across classes. This feature effectively captures the spatial footprint of objects, making it a robust indicator for class separation. The second most important feature is the Object Length Hypothesis, which further enhances classification accuracy by leveraging the longitudinal dimension of objects, a characteristic that varies significantly between categories such as pedestrians and vehicles. Although the Object Width Hypothesis ranks third in importance, it is considered redundant due to its derivation from the same geometric components that define object area. Consequently, it may be excluded from the final feature set to reduce dimensionality without compromising model performance.
In the final feature selection process for radar-based object classification, it is essential to consider physical characteristics such as Signal-to-Noise Ratio (SNR) and Radar Cross Section (RCS), which significantly contribute to the generalization and robustness of machine learning models. While kinematic features, such as speed, acceleration, describe dynamic behavior that can vary widely across scenarios, SNR and RCS provide intrinsic, relatively stable information about the object’s detectability and reflective properties. SNR reflects the clarity of the radar return relative to background noise, indicating how reliably an object can be distinguished, whereas RCS quantifies the reflected radar energy, influenced by the object’s shape, size, orientation, and material composition. The inclusion of RCS in classification tasks has demonstrated considerable efficacy in automotive radar applications. For instance, Nebhwani and Dashpute [
25] showed that temporal RCS patterns extracted from 77 GHz radar enabled real-time classification between ground and non-ground objects, supporting robust feature extraction. Tatarchenko and Rambach [
26] further advanced this approach by employing histogram-based RCS features within a deep learning model, achieving high classification accuracy and robustness against noise and missing data. Complementing these findings, Coşkun and Bilicz [
27] demonstrated that RCS histograms could be effectively used to cluster and classify 14 different vehicle types in real-world driving conditions, highlighting the discriminative power of RCS-based features in complex and variable environments.
3.4. Neural Networks Architecture Based on Grid Search
Here we extend the method presented in ref. [
9]. In this study, we employed the same feature set and neural network architecture as presented in our previous work; however, we utilized an updated dataset and computed additional evaluation metrics to gain deeper insight into the model’s performance.
In the design of the neural network model, the initial step involved determining the optimal set of hyperparameters through a grid search approach. The hyperparameters considered included batch size, number of epochs, dropout rate, optimizer type, learning rate, and the number of hidden units. Based on the results of this search, the selected configuration comprised a batch size of 64, a dropout rate of 0.1, 50 training epochs, 10 neurons per hidden layer, a learning rate of 0.01, and the RMSprop optimizer. RMSprop was chosen due to its computational efficiency and its suitability for datasets with limited variation, making it an ideal choice for the current classification task.
The neural network architecture is the same in ref. [
9], consisting of a flattened input layer, two hidden dense layers, and a dense output layer. Each hidden layer contained 10 neurons and utilized the ReLU activation function to introduce non-linearity. The output layer, designed for a three-class classification problem, employed the SoftMax activation function to produce a probability distribution over the classes. The loss function used was sparse categorical crossentropy, which is appropriate for multi-class classification tasks involving integer-labeled targets. The dataset was partitioned into training, validation, and testing subsets in a 60%:20%:20% ratio, respectively.
Unlike the previous approach, where training was halted upon reaching an accuracy threshold of 99%, this iteration excluded such a constraint due to the increased dataset size. Instead, early stopping was implemented to mitigate overfitting, with a patience parameter of 5 and monitoring based on validation accuracy. Additionally, the ReduceLROnPlateau callback was employed to dynamically adjust the learning rate during training. This mechanism reduced the learning rate by a factor of 0.1 when the validation accuracy plateaued for three consecutive epochs, thereby preventing the model from overshooting optimal solutions and promoting more stable convergence.
Figure 8 shows the architecture used for the neural network for the first method proposed.
In our study, we address the issue of class imbalance in a multi-class classification task. The dataset used exhibits a significant imbalance among these classes, with the “car” class being overrepresented compared to the “pedestrian” and “cyclist” classes. This imbalance can severely affect the performance of the model, as it tends to bias predictions toward the majority class, leading to poor generalization and reduced accuracy for the minority classes. To mitigate this issue, we employed the Synthetic Minority Over-sampling Technique (SMOTE) [
8], a widely recognized method for generating synthetic samples of the minority classes. To ensure reproducibility across different runs, a fixed random seed was set. SMOTE works by creating new synthetic instances of the minority classes through interpolation between existing samples and their nearest neighbors in the feature space. This approach increases the representation of underrepresented classes without simply duplicating existing data, thereby enhancing the diversity and informativeness of the training set. In our context, applying SMOTE ensures that the neural network receives a more balanced and representative view of all three object categories during training. This not only improves the model’s ability to correctly classify pedestrians and cyclists but also contributes to a more robust and fair classification system overall. The use of SMOTE is thus a critical step in enhancing the performance and reliability of our neural network in real-world scenarios where class distributions are inherently uneven.
3.5. Neural Network Architecture Based on Genetic Algorithm
In the design process of the neural network for the second method, we employed a Genetic Algorithm to achieve rapid convergence to an optimal solution. A genetic algorithm is a heuristic search inspired by the process of natural selection, where potential solutions evolve over generations through selection, crossover, and mutation. This method is particularly effective for optimizing complex problems with large search spaces.
In the construction of architecture for neural networks, we consider as hyperparameters the following: batch size, epochs, optimizer (Adam, Root Mean Square Propagation), learning rate, number of neurons in the hidden unit and the number of layers.
For the genetic algorithm, we considered the following parameters: population size of 40, number of generations set to 10, crossover probability of 0.4, and mutation probability of 0.0. These parameters were chosen to enhance genetic diversity, ensure thorough exploration of the search space, and promote the exchange of beneficial traits. These parameters were selected heuristically, guided by values commonly reported in related studies and validated through preliminary experimentation. Such moderate population sizes, low mutation rates, and limited generations have been shown to achieve robust convergence with manageable computational costs in feature selection and classification tasks [
28,
29,
30,
31]. These settings balance search effectiveness with efficiency, avoiding excessive runtime while maintaining solution quality. Parallelization was implemented using Python’s DEAP framework [
32] in combination with the multiprocessing module, replacing DEAP’s default sequential mapping with a parallel map function to evaluate all individuals in a generation concurrently. This embarrassingly parallel approach, well-suited to independent fitness evaluations, enables near-linear speedup as the number of available CPU cores increases.
The parallel GA workflow distributes fitness evaluations across a fixed pool of worker processes, synchronizing results at the end of each generation before applying genetic operators. This method has demonstrated substantial runtime reductions in hyperparameter optimization and feature selection without degrading accuracy [
33,
34,
35]. The efficiency gains stem from the independence of individual evaluations, allowing effective utilization of multi-core architectures. Upon completion, the multiprocessing pool is closed and joined to ensure proper resource management. Such parallelization strategies not only accelerate the evolutionary search process but also improve scalability, making them particularly advantageous for computationally intensive wrapper-based feature selection problems. The first step we need to take in the neural network is to ensure that all the data has the same range. For this, we will apply a MinMaxScaler, ensuring uniform feature contribution, optimizing gradient descent efficiency, and reducing computational complexity. Initially, we applied the MinMaxScaler to the training subset of the dataset. Subsequently, the learned scaling parameters were utilized to transform the validation subset, ensuring consistency in feature scaling across both training and validation data.
We continue to use the Synthetic Minority Over-sampling Technique (SMOTE) [
8] to address the imbalance in the dataset, to generate synthetic data for underrepresented classes. The following two tables present a quantitative summary of the synthetic data generated. Specifically, they detail the number of artificially generated samples per class and the proportion of synthetic samples relative to the original dataset.
Table 2 presents the number of samples per class in the dataset prior to the application of the SMOTE. The Car class dominates the dataset, with a total of 830,764 samples, accounting for approximately 79.2% of the overall data. In contrast, the Pedestrian and Cyclist classes are significantly underrepresented, comprising only 13.7% and 7.0% of the dataset, respectively.
Table 3 reports the number of synthetic samples generated for each minority class following the application of the SMOTE. The objective was to balance the dataset by increasing the size of each minority class (Pedestrian and Cyclist) to match that of the majority class (Car), which contained 830,764 samples. As shown, SMOTE generated 686,730 synthetic samples for the Pedestrian class (47.6%) and 756,987 for the Cyclist class (52.4%). The Car class did not require any synthetic data generation and is therefore not included in this table. This oversampling process effectively mitigated the class imbalance, creating a more uniform distribution across all classes.
To gain a deeper understanding of the characteristics of the synthetic data generated by SMOTE, a comparative feature distribution analysis was conducted. This analysis examined and contrasted the distributions of key features across the original and synthetic datasets. By evaluating the alignment between these distributions, we aimed to assess the consistency and quality of the artificially generated samples and their ability to preserve the statistical properties of the original data.
Table 4 provides a statistical comparison between the original and synthetic datasets for two object classes: pedestrian and cyclist. The evaluation includes key statistical measures: mean, standard deviation (Std), and the Kolmogorov–Smirnov (KS) test, to assess how well the synthetic data generated by SMOTE replicates the distributional properties of the original data. For both pedestrian and cyclist classes, the mean differences across features between original and SMOTE-generated synthetic data are consistently small (0.0006–0.02), indicating that the synthetic data effectively preserves the central tendency of the original distributions. Key discriminative features, such as Filtered Radar Cross Section (RCS), Overground Speed Hypothesis, and Object Length Hypothesis, exhibit particularly low deviations (e.g., 0.0012 for pedestrian RCS, 0.0007 for cyclist object length), reflecting high fidelity in replicating representative characteristics. Variability is also well maintained, with standard deviation ratios, defined as Std(Synthetic)/Std(Original), generally ranging from 0.92 to 1.00, although slight underestimations (e.g., 0.9285 for pedestrian object length, 0.9552 for cyclist cloud width) suggest moderate smoothing from SMOTE’s interpolation process. Kolmogorov–Smirnov (KS) statistics further confirm distributional similarity, with most features showing low KS values (<0.05) and non-significant
p-values (e.g., pedestrian RCS
p = 0.991, cyclist RCS
p = 0.987). However, features such as Hypothesized Area of Object for both classes yield extremely low
p-values (e.g., 1.02 × 10
−104 for pedestrian, 1.37 × 10
−117 for cyclist), indicating distributional differences despite minimal mean deviations.
The statistical comparison demonstrates that the SMOTE-generated synthetic data effectively preserves the key distributional properties of the original data for both pedestrian and cyclist classes. The synthetic samples replicate the original data’s central tendency and spread with high fidelity in most features. Although slight deviations in distributional shape exist, as indicated by some low KS p-values, these are not widespread and do not appear to compromise the representativeness of the synthetic data. Consequently, the SMOTE appears to have generated high-quality synthetic samples, suitable for mitigating class imbalance while preserving the integrity of the original feature space.
To evaluate the quality and distribution of the synthetic data generated for model training, a two-step t-distributed Stochastic Neighbor Embedding (t-SNE) analysis was conducted. The goal was to assess whether the synthetic samples accurately represent the underlying data manifold without introducing aberrant or unrealistic values.
Figure 9 presents the t-SNE visualization of the original and SMOTE-generated synthetic samples for each object class. For the pedestrian case, see
Figure 9a, the synthetic samples (blue crosses) closely overlap with the original samples (blue dots), indicating that SMOTE successfully replicates the underlying distribution and spatial variability of this minority class. Similarly, for the cyclist class (
Figure 9b), synthetic samples (orange crosses) are well integrated within the distribution of original samples (orange dots), demonstrating effective augmentation and preservation of the class’s feature space structure. In contrast, for the car class (
Figure 9c), only original samples (green dots) are shown, as SMOTE was not applied to this majority class to avoid introducing class imbalance or overrepresentation. The visual consistency between synthetic and original samples for the minority classes supports the quality of the data augmentation process and its potential to enhance classifier performance without distorting the intrinsic feature relationships. To enhance the robustness of the model, we employed a dropout technique, specifically discarding one neuron from each hidden layer. This approach is crucial for improving the final convergence of the model. Dropout helps prevent overfitting by randomly omitting neurons during training, thereby forcing the network to learn more generalized patterns.
As a loss function for our neural network, we decided to use sparse categorical cross-entropy due to its memory efficiency and well-behaved gradients, which are essential for effective backpropagation in multi-class classification problems with integer class labels. This loss function computes the cross-entropy between the predicted probability distribution and the actual distribution, represented by the integer class label. Specifically, it evaluates the difference between the predicted probabilities for each class and the actual class label, focusing only on the probability assigned to the true class. This ensures that the loss reflects how well the model’s prediction aligns with the true class, with only the term corresponding to the true class contributing to the loss. By doing so, it effectively minimizes the error associated with the correct classification, thereby enhancing the model’s accuracy and performance.
This choice, combined with adaptive training strategies such as early stopping and learning rate modulation, enhances generalization by preventing overfitting and optimizing computational resources. Early stopping halts training when performance on unseen data begins to deteriorate, while learning rate modulation helps navigate the optimization landscape by reducing the learning rate when the monitored metric plateaus, thereby ensuring robust classification performance.
The neural network architecture presented in
Figure 10 represents the final configuration selected for our classification task and is the result of applying a genetic algorithm to optimize the hyperparameter space. Unlike the Grid Search-based method, where a smaller neural network was used, the current approach uses a more complex model, allowing for a more rigorous evaluation of the network’s capabilities.
The genetic algorithm employed in this work is a population-based metaheuristic inspired by the principles of natural selection and evolution. It iteratively evolves a population of candidate solutions, each representing a unique combination of hyperparameters such as batch size, number of epochs, optimizer type, learning rate, number of neurons, and number of hidden layers, through operations like selection, crossover, and mutation. By evaluating the fitness of each candidate based on model performance metrics (e.g., accuracy, precision, recall), the algorithm converges toward an optimal or near-optimal configuration. This approach is particularly advantageous in high-dimensional search spaces, where exhaustive methods like grid search become computationally prohibitive. In our implementation, the genetic algorithm was executed in parallel to accelerate convergence, ultimately yielding a neural network architecture that balances complexity and performance effectively.
We maintain a consistent data partitioning strategy to ensure robust model training and evaluation. Specifically, the dataset is divided into three subsets: 60% is allocated for training, 20% for validation, and the remaining 20% for testing. This ratio is chosen to provide a sufficient amount of data for the model to learn effectively, while also preserving distinct subsets for tuning hyperparameters and assessing generalization performance.
To assess the risk of over-parameterization, the ratio between the number of training samples and model parameters was calculated. The model has 8043 trainable parameters and was trained on approximately 1,495,375 samples (this includes the real data and synthetic data), resulting in a data-to-parameter ratio of approximately 186:1. This ratio suggests that the model has sufficient data to generalize effectively without overfitting, assuming proper regularization and training procedures are applied.