5.1. Experimental Summary
The experimental results demonstrate two major findings regarding CNN behavior and dataset-driven performance prediction. First, the systematic ablation study revealed that object-level pattern features play a decisive role in shaping CNN generalization. The models trained on datasets containing object patterns (Datasets 3 and 4) achieved substantially higher performance on the synthetic testbed, with mean accuracy of , compared to for datasets lacking such patterns . A similar trend was observed for F1 scores, where object pattern datasets reached versus for non-object datasets . These results indicate that internal structural variation rather than background variation provides more discriminative information for shallow CNNs.
Second, the dataset similarity prediction algorithm achieved a strong correlation between predicted and observed performance on the synthetic testbed (, , ). The algorithm correctly estimated that Dataset 4, which shares the highest feature overlap with the testbed (87%), would yield the strongest performance (∼88.8%), whereas Datasets 1–3, with lower overlaps (65–75%), would perform similarly but worse. This result provides empirical support for the restricted-form CNN–Apriori equivalence proposed in this work, and suggests that feature overlap can serve as a meaningful proxy for expected model performance.
Taken together, these findings introduce a novel framework in which dataset feature composition rather than model architecture forms the primary basis for predicting CNN behavior. This contributes to a shift toward dataset-centric evaluation and optimization, offering an efficient alternative to model-driven interpretability methods.
5.2. Limitations and Future Work
Although promising, the proposed framework is subject to several important limitations. First, all experiments were conducted using controlled synthetic datasets in which feature placement, frequency, and class relationships are deterministic. As a result, the behavior of the dataset similarity algorithm and the CNN–Apriori correspondence has not yet been validated on real-world images, where uncontrolled correlations, noise, and entangled features may reduce the stability of these relationships.
The CNN–Apriori equivalence is derived under restricted assumptions, including a single convolutional layer, linear or piecewise-linear activations, no regularization, and a stationary feature distribution. These conditions do not extend to deeper architectures such as ResNet or VGG, where nonlinear interactions and skip connections break the simplifying assumptions required for the equivalence. In such cases, the proposed similarity framework should be viewed as a heuristic for dataset comparison rather than a general theoretical identity.
The computational advantages of the method come with tradeoffs. The dataset similarity estimator operates in time with respect to the number of feature groups, since no perturbation sampling or model retraining is required. By contrast, SHAP and LIME often require between and evaluations per sample. However, the simplicity of the estimator means that it may not capture higher-order feature interactions present in real-world datasets.
While the Overlap metric is stable under synthetic control, its reliability on real datasets remains to be examined. Real-world image features are often nonlinear, unlabeled, and spatially diffuse, making exact feature overlap difficult to quantify. Future studies will explore adaptations of this metric to noisy, high-dimensional visual domains.
Finally, the correlation result () was computed using the four dataset conditions designed for this study. A Fisher z-transform was used to obtain the corresponding confidence interval; however, the small sample size limits the statistical generality of this estimate. Expanding the number and complexity of dataset configurations will be an important direction for future work.
Overall, this work establishes a controlled foundation for understanding dataset-driven performance in CNNs, but additional empirical validation on large-scale and real-world datasets will be necessary to assess the generality of these findings.