Gaining Understanding of Neural Networks with Programmatically Generated Data

O’Sullivan, Eric; Kennedy, Ken; Mohammadi-Aragh, Jean

doi:10.3390/mca31010016

Open AccessArticle

Gaining Understanding of Neural Networks with Programmatically Generated Data

by

Eric O’Sullivan

^1,*,

Ken Kennedy

² and

Jean Mohammadi-Aragh

¹

Bagley College of Engineering, Mississippi State University, Mississippi State, MS 39762, USA

²

SkaiVison LLC., Raleigh, NC 27603, USA

^*

Author to whom correspondence should be addressed.

Math. Comput. Appl. 2026, 31(1), 16; https://doi.org/10.3390/mca31010016

Submission received: 6 October 2025 / Revised: 2 January 2026 / Accepted: 9 January 2026 / Published: 22 January 2026

(This article belongs to the Special Issue Applied Optimization in Automatic Control and Systems Engineering)

Download

Browse Figures

Versions Notes

Abstract

The performance of convolutional neural networks (CNNs) depends not only on model architecture but also on the structure and quality of the training data. While most artificial network interpretability methods focus on explaining trained models, less attention has been given to understanding how dataset composition itself shapes learning outcomes. This work introduces a novel framework that uses programmatically generated synthetic datasets to isolate and control visual features, enabling systematic evaluation of their contribution to CNN performance. Guided by principles from set theory, Shapley values, and the Apriori algorithm, we formalize an equivalence between CNN kernel weights and pattern frequency counts, showing that feature overlap across datasets predicts model generalization. Methods include the construction of four synthetic digit datasets with controlled object and background features, training lightweight CNNs under K-fold validation, and statistical evaluation of cross-dataset performance. The results show that internal object patterns significantly improve accuracy and F1 scores compared to non-object background features, and that a dataset similarity prediction algorithm achieves near-perfect correlation (

ρ = 0.97

) between the predicted and observed performance. The conclusions highlight that dataset feature composition can be treated as a measurable proxy for model behavior, offering a new path for dataset evaluation, pruning, and design optimization. This approach provides a principled framework for predicting CNN performance without requiring full-scale model training.

Keywords:

convolutional neural network; CNN; set theory; artificial intelligence; computer vision

1. Introduction

As the complexity and scale of modern artificial neural network (ANN) tasks increase, the demand for high-quality training data has grown proportionally. Prior work shows that the performance of machine learning models, including ANNs, improves as training data or model capacity increases [1]. However, even very large models still exhibit failure modes such as hallucination, often caused by redundant, misleading, or noisy patterns in the training set [2,3]. These behaviors reflect the strong influence of dataset composition on learned representations.

Because dataset structure directly affects model behavior, numerous approaches have been proposed to prune, clean, or filter low-quality samples [4,5,6]. Other strategies expand datasets through synthetic distortions or automated augmentation pipelines, which have been shown to improve performance across a variety of computer vision tasks [7]. Convolutional neural networks (CNNs) remain widely used for such tasks due to their ability to decompose spatial structure in images [8,9]. However, this same structural complexity makes dataset-level analysis difficult. Existing interpretability methods such as SHAP and LIME [10,11] explain a trained model’s predictions but provide limited insight into how the dataset itself organizes the latent feature space.

This work introduces a framework for analyzing dataset-level feature importance using programmatically generated synthetic data. The key idea is to isolate specific visual features across multiple datasets so that their individual contributions to CNN performance can be measured systematically. Rather than deriving feature importance from a trained model, the framework draws on principles from set theory, Shapley values, and the Apriori algorithm to examine how the presence, absence, and overlap of feature groups influence downstream learning behavior. By constructing controlled synthetic datasets with well-defined object and non-object features, a lightweight CNN serves as a proxy for estimating the empirical contribution of each feature group. This enables a pre-training form of feature attribution based on dataset structure rather than post hoc model inspection.

2. Theoretical Foundation

This section develops the mathematical and conceptual basis for the proposed dataset feature analysis framework. The goal is to formalize how feature groups, when deliberately isolated and permuted across synthetic datasets, give rise to predictable performance trends in convolutional neural networks (CNNs). By designing datasets in which object and non-object features can be controlled independently, we obtain a setting in which feature overlap with a composite testbed can be quantified and related to network behavior. Throughout this section, emphasis is placed on the restricted conditions under which the theory applies, consistent with the controlled nature of the synthetic datasets used in this study.

2.1. Neural Networks and the Role of Data

Neural networks, including CNNs, learn by minimizing a loss function over the training set [9]. The model iteratively adjusts its internal parameters to reduce the error between predicted and true labels. The structure, diversity, and quality of the training dataset therefore play a central role in determining model behavior [9]. A model that overfits is not necessarily failing; in many cases, it reflects the presence of tightly coupled or misleading correlations in the dataset that do not generalize to new data [1,8]. This highlights that generalization is shaped not only by model capacity but also by the complexity and distribution of the training samples.

Biases in the dataset directly influence the learned parameters. When certain features or feature groups appear disproportionately or are consistently aligned with a class label, the optimizer assigns undue weight to them, regardless of causal relevance [8,9,12,13,14]. This can lead to shortcut learning and confounding signals [15]. Although undesirable in real-world contexts, this effect is advantageous in the present work: controlled feature biases allow us to observe how CNNs shift focus in response to the presence or absence of specific visual structures.

2.2. Convolutional Neural Networks and Data Proxies

2.2.1. CNNs as Pattern Learners

Convolutional neural networks operate by applying a kernel K of size

k \times k

across each input sample

T_{i} \in D

to detect recurring spatial patterns [8,16]. Each kernel K acts as a detector for a particular pattern

P_{K} \subseteq F

, where F denotes the set of all features available in the dataset [9]. The convolution produces an activation map A, where each element

A_{i, j}

measures the similarity between

P_{K}

and a local region of

T_{i}

.

During training, kernels corresponding to frequently occurring patterns receive larger gradient contributions and therefore larger weight updates [1,8]. Consequently, the magnitude of the learned weights becomes proportional to

{support}_{D} (P_{K})

, the empirical frequency of

P_{K}

in the dataset. This parallels the Apriori algorithm, in which high support patterns drive pattern selection and influence subsequent inference [17]. In CNNs, the activation strength at a given location plays a role analogous to local support for

P_{K}

.

A further connection arises from the Discrete Convolution Theorem: repeated spatial patterns in the input domain lead to amplified responses in the frequency domain [18]. Under the restricted setting of a single convolutional layer, linear activation, and no regularization, CNN kernels can therefore be interpreted as implicit estimators of pattern frequency. Later subsections formalize this intuition under these limited conditions.

2.2.2. Feature Control via Well-Ordered Feature Groups

The synthetic datasets used in this study allow specific feature groups to be added, removed, or varied in isolation. We distinguish between intrinsic object features (

F_{ICF}

) and non-object features (

F_{NCF}

), which are placed in disjointed spatial regions to avoid confounding effects. By controlling these groups independently, we obtain a set of datasets in which only one factor differs at a time, enabling clean attribution of performance changes to specific visual structures.

This design parallels the logic of Shapley values, where the marginal contribution of a feature is measured across multiple combinations of subsets [11]. The feature groups in our synthetic datasets form a Well-Ordered Set [19], ensuring that contributions can be compared consistently across permutations. As a result, the trained CNN functions as a proxy for measuring how the presence or absence of feature groups affects classification, producing a pre-training form of feature attribution based on dataset structure rather than post hoc model interpretation [20].

2.3. CNN–Apriori Equivalence

2.3.1. Apriori Algorithm Formalization

The Apriori algorithm identifies frequent patterns (itemsets) by counting their support across a dataset of transactions. For a dataset D composed of transactions

T_{i}

, the support of an itemset X is defined in Equation (1).

support (X) = \frac{| {T_{i} \in D : X \subseteq T_{i}} |}{| D |}

(1)

A pattern X is frequent if

support (X) \geq min_support

. The downward closure property implies that if X is infrequent, then all supersets of X are also infrequent. Within the present framework, an image corresponds to a transaction, and extracted object or background patterns correspond to items. This establishes a symbolic mapping between visual patterns and the pattern mining structure formalized by Apriori.

2.3.2. CNN Pattern Detection Formalization

In a CNN, each kernel K detects a spatial pattern

P_{K}

in an input image I. The convolution operation produces an activation map A, where

A_{i, j}

measures the degree of matching between

P_{K}

and a local region of I, as defined in Equation (2).

A_{i, j} = \sum_{m = 0}^{k - 1} \sum_{n = 0}^{k - 1} I_{i + m, j + n} \cdot K_{m, n}

(2)

High activations indicate strong matches to

P_{K}

, mapping image regions to a latent pattern space where activation strength reflects both the frequency and salience of

P_{K}

.

2.3.3. Equivalence Principle: CNN Weights as Pattern Frequency

Theorem 1

(Under Restricted Conditions). Under a single convolutional layer, linear (or piecewise linear) activation, no regularization, and a stationary pattern distribution, the learned weights of a kernel K approximate the support of pattern

P_{K}

in the dataset D.

Training minimizes a loss function

L

over predictions

f (x_{i}; W)

as shown in Equation (3). Weight updates follow Equation (4).

L = \frac{1}{N} \sum_{i = 1}^{N} L (f (x_{i}; W), y_{i})

(3)

w_{m, n}^{(t + 1)} = w_{m, n}^{(t)} - α \frac{\partial L}{\partial w_{m, n}}

(4)

Under the restricted assumptions stated above, the gradient term

\frac{\partial L}{\partial w_{m, n}}

becomes proportional to the number of occurrences of

P_{K}

in the training set, as expressed in Equation (5). The final kernel weights therefore satisfy Equation (6), linking weight magnitude to empirical pattern frequency.

\frac{\partial L}{\partial w_{m, n}} \propto \sum_{i = 1}^{N} \frac{\partial L}{\partial A_{i, j}} \cdot I_{i + m, j + n}

(5)

w_{m, n}^{(f i n a l)} \propto f r e q u e e n y (P_{K}) = \frac{| {x_{i} \in D : P_{K} \subseteq x_{i}} |}{| D |}

(6)

This supports the analogy that CNN kernels behave like frequency counters for visual patterns in controlled settings, mirroring the support counting mechanism of the Apriori algorithm.

2.3.4. Max Pooling as Support Threshold

Theorem 2

(Under Restricted Conditions). Max pooling acts as a thresholding operator that preserves only the strongest activations. For a pooling region R, the pooling operation is given by Equation (7).

p o o l {(A)}_{i, j} = max_{m, n \in w i n d o w} A_{i + m, j + n}

(7)

Only sufficiently strong responses propagate to deeper layers. This mirrors the min_support threshold in Apriori, where low frequency patterns are removed.

2.3.5. Frequent Pattern Mining in Image Data

An image can be interpreted as a transaction of spatially embedded visual tokens. A pattern P may be defined as a spatial arrangement applied to an object or background region. For dataset D, the support of P is given in Equation (8).

{support}_{I} (P) = \frac{| {regions R \subseteq I : P \subseteq R} |}{| {all regions in I} |}

(8)

CNNs amplify patterns whose

{support}_{D} (P)

exceeds an implicit activation threshold, making convolutional training analogous to spatial frequent pattern mining. These parallels form the basis of the dataset similarity analysis used later in this work.

In the next section, we describe how these theoretical parallels guided the design of our synthetic datasets, CNN architecture, and evaluation pipeline.

3. Methodology

The purpose of the synthetic data generation process is to create controlled datasets with overlapping and non-overlapping features, thereby reducing the latent space required for class discrimination. By systematically varying which features are present in objects versus backgrounds, we can analyze how convolutional neural networks prioritize different visual patterns when learning to classify digits.

3.1. Model Creation

A lightweight convolutional neural network (CNN) built in PyTorch, inspired by LeNet style architectures is used as a data proxy [16]. The model is designed for grayscale input images of size

28 \times 28

and outputs class probabilities over 10 categories.

The architecture consists of one convolutional layer followed by max pooling, a flattening operation, and one fully connected layer. Nonlinear activation is introduced through the Rectified Linear Unit (ReLU) function after each convolution and the first fully connected layer. A summary of the model architecture is shown in Table 1.

Training Configuration

All models were trained using the categorical cross-entropy loss, which is appropriate for multi-class classification with mutually exclusive labels. Optimization was performed using the Adam optimizer with a learning rate of

1 \times 10^{- 3}

and default momentum parameters

(β_{1} = 0.9, β_{2} = 0.999)

. No explicit weight decay or regularization was applied, as the goal of the study is to observe raw feature-learning behavior under controlled conditions.

The model architecture is intentionally lightweight and shallow to minimize confounding effects from depth, skip connections, or implicit regularization. This design choice ensures that observed performance differences can be attributed directly to dataset composition and feature overlap rather than architectural complexity.

3.2. Dataset Generation

Four synthetic dataset types (Datasets 1–4) were generated under controlled feature configurations, each containing 1850 samples (185 per class), for a total of 7400 samples. For within-dataset training and evaluation, each dataset type was split into 80% training (1480 samples) and 20% held-out testing (370 samples). These splits were performed independently per dataset type, so that each model is trained only on samples from its assigned feature configuration. Image size and object location were fixed, as these factors do not significantly affect model outcomes in this setting [8,9,21].

To measure cross-dataset generalization, we additionally define a unified testbed

D_{test}

as the union of all samples across the four dataset types:

D_{test} = D_{1} \cup D_{2} \cup D_{3} \cup D_{4}

, yielding 7400 total images (740 per class). No model is trained on the unified testbed; it is used only for evaluation after training on a single dataset type. This construction is intentional: because each model is trained on one quarter of the feature configurations represented in

D_{test}

, a naive lower bound of approximately 25% accuracy is expected if the model only succeeds on the feature regime it was trained under. Performance above this baseline indicates that learned patterns transfer across feature regimes, allowing the CNN to be interpreted as a feature-collection mechanism whose kernels encode reusable patterns rather than dataset-specific artifacts. The unified testbed therefore enables a direct comparison of which feature groups yield the strongest transfer across all feature combinations.

3.3. Synthetic Dataset Generation Procedure

Each image is generated through a deterministic pipeline consisting of four stages: background initialization, pattern assignment, digit rendering, and composition. All random operations are controlled using fixed seeds to ensure reproducibility across datasets.

1.: Initialize a blank $28 \times 28$ grayscale canvas.
2.: Apply either a solid fill or a deterministic background pattern based on the dataset configuration.
3.: Render a centered digit mask.
4.: Apply an object-level pattern ( $F_{ICF}$ ) within the digit mask if enabled.
5.: Combine object and background layers using pixel-wise masking.

Distinct seed ranges are assigned to

F_{ICF}

and

F_{NCF}

to prevent accidental feature overlap. This guarantees that patterns are reused consistently across datasets while remaining isolated within the object or background region.

3.4. Feature Pattern System and Dataset Configuration

We define six distinct pattern types: stripes, checkerboard, noise, dots, squares, and waves. Patterns are applied deterministically using fixed random seeds to ensure reproducibility and that the patterns are reproduced across the same sample for each dataset. Internal Class Features (

F_{ICF}

) and Non-Class Features (

F_{NCF}

) use disjoint seed ranges to prevent unintentional overlap. The patterns assigned to

F_{ICF}

and

F_{NCF}

are shown in Figure 1, Figure 2 and Figure 3.

Dataset Configurations

Dataset configurations are defined using a truth table that specifies which combinations of

F_{ICF}

and

F_{NCF}

appear in each dataset (Table 2). This representation allows explicit control over feature overlap between object and background regions, ensuring that the superset is Well Related to each subset used in model training, enforcing that all isolated feature groups are also Well Related to the testbed. A visual example of these feature ablations is provided in Figure 4.

3.5. Training and Evaluation Procedure

Each dataset subset was used to train a separate model, resulting in four trained models in total. Training followed a 10-fold cross-validation procedure (

K = 10

), defined in Equations (9) and (10), with each fold running for 10 epochs using a batch size of 32. This approach provides both an estimate of mean performance, denoted

{\bar{μ}}_{d}

, and its associated variability,

σ_{d}

, where

μ_{d, k}

refers to the performance of fold k for dataset d. After training, all models were evaluated on the unified testbed to enable consistent comparisons.

{\bar{μ}}_{d} = \frac{1}{K} \sum_{k = 1}^{K} μ_{d, k}

(9)

σ_{d} = \sqrt{\frac{1}{K - 1} \sum_{k = 1}^{K} {(μ_{d, k} - {\bar{μ}}_{d})}^{2}}

(10)

Computational Resources

All experiments were conducted on a workstation equipped with an NVIDIA RTX 3090 GPU and 64 GB of system memory. Each model required approximately 3 min per training run, resulting in a total training time of roughly 2 h for all cross-validation folds and datasets combined. This modest computational cost highlights the efficiency and reproducibility of the proposed framework.

3.6. Model Metrics and Statistical Evaluation

To characterize model behavior, we report standard classification measures, including accuracy, precision, recall, and F1 score, both in macro averaged form and on a per-class basis.

To assess whether observed differences in performance across datasets are statistically meaningful, we conduct paired t-tests with Bonferroni correction to account for multiple comparisons. Beyond significance testing, we estimate effect sizes using Cohen’s d to measure the magnitude of differences, and we construct 95% confidence intervals around each metric to provide a clear indication of uncertainty in the reported results.

3.7. Feature Analysis

Feature-level analysis is performed by comparing datasets that differ only in the inclusion of intrinsic content features (

F_{ICF}

) or non-content features (

F_{NCF}

). This comparison reveals how specific attributes influence learning outcomes, while activation strength for targeted visual patterns across layers highlights their relative importance. Redundancy is further identified through cross-dataset generalization, which exposes features that contribute little unique value. Importantly, the deterministic assignment of

F_{ICF}

and

F_{NCF}

allows specific patterns to be tracked in the learned latent space, enabling us to quantify how CNNs encode feature overlap and to distinguish between discriminative and redundant features in a controlled, reproducible setting.

The weighting used in the dataset similarity algorithm is derived directly from the controlled experimental design. As each synthetic dataset isolates one feature group at a time, the contribution of each feature group can be treated as approximately linear with respect to its overlap with the testbed. The similarity score corresponds to a weighted sum of the presence or absence of each feature group, where the weights reflect their empirically observed contribution to classification accuracy.

4. Results

4.1. Cross-Validation Stability

We first evaluate training stability using 10-fold cross-validation to ensure that observed performance differences are not artifacts of data partitioning. As shown in Table 3, validation accuracy remains highly consistent across folds, with standard deviations below 2% for all datasets. This confirms that the training process is stable and that subsequent comparisons across datasets are meaningful.

In addition to mean and standard deviation, we report the minimum, maximum, and median validation accuracy to characterize worst-case sensitivity and potential ceiling effects. The narrow min–max range and median values near 100% indicate that no individual fold exhibits degraded performance and that results are not driven by outliers.

4.2. Generalization to the Well-Related Synthetic Testbed

Each fold-trained model was evaluated on the unified synthetic testbed to assess cross-dataset generalization. The results in Table 4 reveal clear performance differences attributable to dataset composition.

The models trained on datasets containing both internal object patterns and patterned backgrounds (Dataset 4) achieved the highest mean performance (Accuracy

\approx 0.89

, F1

\approx 0.88

). In contrast, datasets lacking internal object patterns (Datasets 1 and 2) exhibited consistently lower performance. While Dataset 4 shows higher variance, this reflects increased feature richness rather than instability, as performance remains superior across all folds.

4.3. Diagnostic Full-Data Training Analysis

After establishing cross-validation stability, we trained a single model per dataset using the full dataset without fold partitioning. This analysis is included strictly as a diagnostic sanity check to examine whether trends observed under cross-validation persist when all available data are used for training.

Importantly, these results are not used for statistical inference or model comparison, which are based exclusively on cross-validated evaluations. As shown in the full-data diagnostic results in Table 5, datasets containing internal object patterns (Datasets 3 and 4) continue to outperform those without, reinforcing the conclusions drawn from the cross-validation analysis.

4.4. Feature Contribution and Dataset Similarity Validation

To quantify the relative contribution of internal object patterns (

F_{ICF}

) versus non-object background patterns (

F_{NCF}

), we conducted paired t-tests with Bonferroni correction. As summarized in Table 6, internal object patterns yield statistically significant improvements in both accuracy and F1 score on the synthetic testbed.

Building on this observation, we validate the dataset similarity prediction algorithm derived from the CNN–Apriori equivalence. Using Equation (11), we estimate feature overlap between each training dataset and the testbed, and map this overlap to predicted performance via Equation (12). The predicted values closely match the observed performance, particularly for Dataset 4 (Predicted: 87.0%, Actual: 88.8%).

The correlation analysis confirms a strong monotonic relationship between predicted and observed accuracy (

ρ = 0.9691

,

p = 0.0309

,

R^{2} = 0.9392

), computed over the four controlled synthetic datasets. Despite the small sample size, Fisher z-transformed confidence intervals support a robust association, as visualized in Figure 5. These results provide empirical validation that feature overlap directly estimates CNN generalization behavior, consistent with the theoretical proof.

4.5. Mathematical Proof Validation: Dataset Similarity Prediction Algorithm

Finally, we validated our mathematical proof connecting CNNs and the Apriori Principles by developing a dataset feature overlap estimation framework. The algorithm estimates the overlap between each training dataset and the synthetic testbed using Equation (11). While not a direct measure of a model’s performance, it will indicate the feature overlaps and estimate performance, as reflected in Equation (12).

Overlap (D_{i}, D_{test}) = \frac{\sum_{f \in F} w_{f} I [f \in D_{i} \land f \in D_{test}]}{\sum_{f \in F} w_{f}}

(11)

\hat{A} (D_{i}) = A_{0} + Overlap (D_{i}, D_{test}) (1 - A_{0})

(12)

4.5.1. Prediction and Validation Results

Our analysis approximates the following overlaps with the synthetic testbed: Dataset 1 = 65%, Dataset 2 = 75%, Dataset 3 = 75%, and Dataset 4 = 87%. Using Equations (11) and (12), these overlaps were mapped to predicted performance values. The predicted and actual results aligned closely, particularly for Dataset 4 (Predicted: 87.0%, Actual: 88.8%), confirming that higher feature overlap corresponds to improved CNN generalization.

Statistical validation further supported this framework, with the correlation analysis revealing

ρ = 0.9691

(

p = 0.0309

,

R^{2} = 0.9392

), demonstrating that the dataset similarity prediction algorithm reliably captures the relationship between feature overlap and observed CNN performance. This can be visualized in Figure 5. The correlation analysis was computed over

n = 4

dataset conditions, corresponding to the four controlled synthetic datasets. The confidence interval for the reported correlation was computed using a Fisher z-transform, yielding a 95% confidence interval consistent with a strong monotonic relationship despite the small sample size.

4.5.2. CNN Apriori Equivalence Validation

This validates our mathematical proof that CNNs and the Apriori algorithm perform equivalent operations: CNN weights encode pattern frequency information (analogous to Apriori support), and feature overlap directly estimates model performance.

5. Discussion

5.1. Experimental Summary

The experimental results demonstrate two major findings regarding CNN behavior and dataset-driven performance prediction. First, the systematic ablation study revealed that object-level pattern features play a decisive role in shaping CNN generalization. The models trained on datasets containing object patterns (Datasets 3 and 4) achieved substantially higher performance on the synthetic testbed, with mean accuracy of

0.845 \pm 0.045

, compared to

0.775 \pm 0.005

for datasets lacking such patterns

(p < 0.01)

. A similar trend was observed for F1 scores, where object pattern datasets reached

0.845 \pm 0.035

versus

0.755 \pm 0.005

for non-object datasets

(p < 0.05)

. These results indicate that internal structural variation rather than background variation provides more discriminative information for shallow CNNs.

Second, the dataset similarity prediction algorithm achieved a strong correlation between predicted and observed performance on the synthetic testbed (

ρ = 0.9691

,

p = 0.0309

,

R^{2} = 0.9392

). The algorithm correctly estimated that Dataset 4, which shares the highest feature overlap with the testbed (87%), would yield the strongest performance (∼88.8%), whereas Datasets 1–3, with lower overlaps (65–75%), would perform similarly but worse. This result provides empirical support for the restricted-form CNN–Apriori equivalence proposed in this work, and suggests that feature overlap can serve as a meaningful proxy for expected model performance.

Taken together, these findings introduce a novel framework in which dataset feature composition rather than model architecture forms the primary basis for predicting CNN behavior. This contributes to a shift toward dataset-centric evaluation and optimization, offering an efficient alternative to model-driven interpretability methods.

5.2. Limitations and Future Work

Although promising, the proposed framework is subject to several important limitations. First, all experiments were conducted using controlled synthetic datasets in which feature placement, frequency, and class relationships are deterministic. As a result, the behavior of the dataset similarity algorithm and the CNN–Apriori correspondence has not yet been validated on real-world images, where uncontrolled correlations, noise, and entangled features may reduce the stability of these relationships.

The CNN–Apriori equivalence is derived under restricted assumptions, including a single convolutional layer, linear or piecewise-linear activations, no regularization, and a stationary feature distribution. These conditions do not extend to deeper architectures such as ResNet or VGG, where nonlinear interactions and skip connections break the simplifying assumptions required for the equivalence. In such cases, the proposed similarity framework should be viewed as a heuristic for dataset comparison rather than a general theoretical identity.

The computational advantages of the method come with tradeoffs. The dataset similarity estimator operates in

O (n)

time with respect to the number of feature groups, since no perturbation sampling or model retraining is required. By contrast, SHAP and LIME often require between

10^{3}

and

10^{5}

evaluations per sample. However, the simplicity of the estimator means that it may not capture higher-order feature interactions present in real-world datasets.

While the Overlap

(D_{i}, D_{test})

metric is stable under synthetic control, its reliability on real datasets remains to be examined. Real-world image features are often nonlinear, unlabeled, and spatially diffuse, making exact feature overlap difficult to quantify. Future studies will explore adaptations of this metric to noisy, high-dimensional visual domains.

Finally, the correlation result (

ρ = 0.9691

) was computed using the four dataset conditions designed for this study. A Fisher z-transform was used to obtain the corresponding confidence interval; however, the small sample size limits the statistical generality of this estimate. Expanding the number and complexity of dataset configurations will be an important direction for future work.

Overall, this work establishes a controlled foundation for understanding dataset-driven performance in CNNs, but additional empirical validation on large-scale and real-world datasets will be necessary to assess the generality of these findings.

Author Contributions

Conceptualization, E.O.; methodology, E.O.; software, E.O.; validation, E.O., K.K. and J.M.-A.; formal analysis, E.O.; investigation, E.O.; resources, E.O.; data curation, E.O.; writing—original draft preparation, E.O.; writing—review and editing, E.O., K.K. and J.M.-A.; visualization, E.O.; supervision, J.M.-A.; project administration, E.O.; funding acquisition, J.M.-A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study were entirely generated through the synthetic dataset creation process described in the methodology. No external or proprietary datasets were used. The generated datasets are reproducible following the procedures outlined in this manuscript. No additional archived data are available.

Acknowledgments

During the preparation of this manuscript, the author used OpenAI’s ChatGPT (GPT-5, 2025) to assist with code generation for Python figure and table creation and LaTeX formatting. The author reviewed and edited all AI-generated outputs and takes full responsibility for the content of this publication.

Conflicts of Interest

The author declares no conflicts of interest. The author declares that the research was conducted in the absence of any commercial or financial relationships from SkaiVison LLC. that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CNN	Convolutional Neural Network
ANN	Artificial Neural Network

References

Geman, S.; Bienenstock, E.; Doursat, R. Neural Networks and the Bias/Variance Dilemma. Neural Comput. 1992, 4, 1–58. [Google Scholar] [CrossRef]
Bruno, A.; Mazzeo, P.L.; Chetouani, A.; Tliba, M.; Kerkouri, M.A. Insights into Classifying and Mitigating LLMs’ Hallucinations. arXiv 2023, arXiv:2311.08117. [Google Scholar] [CrossRef]
McKenna, N.; Li, T.; Cheng, L.; Hosseini, M.; Johnson, M.; Steedman, M. Sources of Hallucination by Large Language Models on Inference Tasks. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 2758–2774. [Google Scholar]
Bhattacharjee, R.; Frohnapfel, K.; von Luxburg, U. How to safely discard features based on aggregate SHAP values. arXiv 2025, arXiv:2503.23111. [Google Scholar] [CrossRef]
Moser, B.B.; Raue, F.; Dengel, A. A Study in Dataset Pruning for Image Super-Resolution. arXiv 2024, arXiv:2403.17083. [Google Scholar] [CrossRef]
Yu, H.; Yuan, X.; Jiang, R.; Feng, H.; Liu, J.; Li, Z. Feature Reduction Networks: A Convolution Neural Network-Based Approach to Enhance Image Dehazing. Electronics 2023, 12, 4984. [Google Scholar] [CrossRef]
Cubuk, E.D.; Zoph, B.; Mané, D.; Vasudevan, V.; Le, Q.V. AutoAugment: Learning Augmentation Policies from Data. arXiv 2018, arXiv:1805.09501. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Chollet, F. Deep Learning with Python, 2nd ed.; Manning: Shelter Island, NY, USA, 2021. [Google Scholar]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?” Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), San Francisco, CA, USA, 13–17 August 2016; ACM: New York, NY, USA, 2016; pp. 1135–1144. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 4765–4774. [Google Scholar]
Lu, K.; Grover, A.; Abbeel, P.; Mordatch, I. Pretrained Transformers as Universal Computation Engines. arXiv 2021, arXiv:2103.05247. [Google Scholar] [CrossRef]
Cybenko, G. Approximation by Superpositions of a Sigmoidal Function. Math. Control. Signals Syst. 1989, 2, 303–314. [Google Scholar] [CrossRef]
Lample, G.; Charton, F. Deep Learning for Symbolic Mathematics. arXiv 2019, arXiv:1912.01412. [Google Scholar] [CrossRef]
Su, J.; Vargas, D.V.; Sakurai, K. One pixel attack for fooling deep neural networks. arXiv 2017, arXiv:1710.08864. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Agrawal, R.; Srikant, R. Fast Algorithms for Mining Association Rules in Large Databases. In Proceedings of the 20th International Conference on Very Large Data Bases, San Francisco, CA, USA, 12–15 September 1994; pp. 487–499. [Google Scholar]
Smith, J.O. Spectral Audio Signal Processing. Available online: http://ccrma.stanford.edu/~jos/sasp/ (accessed on 6 October 2025).
Feferman, S. Some applications of the notions of forcing and generic sets. Fundam. Math. 1964, 56, 325–345. [Google Scholar] [CrossRef]
Zheng, Q.; Wang, Z.; Zhou, J.; Lu, J. Shap-CAM: Visual Explanations for Convolutional Neural Networks based on Shapley Value. arXiv 2022, arXiv:2208.03608. [Google Scholar] [CrossRef]
Etter, D.; Rawls, S.; Carpenter, C.; Sell, G. A Synthetic Recipe for OCR. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, NSW, Australia, 20–25 September 2019; pp. 864–869. [Google Scholar]

Figure 1. Examples of

F_{ICF}

and

F_{NCF}

patterns.

Figure 1. Examples of

F_{ICF}

and

F_{NCF}

patterns.

Figure 2. Reuse of the same

F_{ICF}

pattern across different digits.

Figure 2. Reuse of the same

F_{ICF}

pattern across different digits.

Figure 3. Different

F_{ICF}

patterns applied to digit 8.

Figure 3. Different

F_{ICF}

patterns applied to digit 8.

Figure 4. Example of feature isolation across the four subsets of data.

Figure 5. Predicted vs. actual accuracy on the synthetic testbed with regression fit.

Table 1. Network architecture and activation functions.

Layer	Activation
Conv2d (1, 32, 3 × 3)	ReLU
MaxPool2d (2 × 2)	–
Flatten	–
Linear (800 → 128)	ReLU
Linear (128 → 10)	–

Table 2. Synthetic dataset configurations with background/object types and controlled overlaps.

Dataset	Background	Object	Subset Background Overlap	Subset Object Overlap
1	Solid	Solid	1, 3	1, 2
2	Pattern	Solid	2, 4	1, 2
3	Solid	Pattern	1, 3	3, 4
4	Pattern	Pattern	2, 4	3, 4

Table 3. Aggregate training performance metrics (mean ± std) across all folds for each dataset, showing training stability.

Metric	Value
Validation Accuracy (mean ± std)	$99.79 \pm 0.42$ %
Validation Accuracy (min, max)	$97.92$ , $100.0$ %
Median Validation Accuracy	$100.0$ %

Table 4. Synthetic testbed evaluation of fold trained models, reporting aggregate test performance (mean ± std) across all folds.

Dataset	Accuracy	Precision	Recall	F1
1	$0.7863 \pm 0.0175$	$0.8320 \pm 0.0072$	$0.7863 \pm 0.0175$	$0.7935 \pm 0.0134$
2	$0.8072 \pm 0.0184$	$0.8771 \pm 0.0065$	$0.8072 \pm 0.0184$	$0.8175 \pm 0.0156$
3	$0.7778 \pm 0.0176$	$0.8347 \pm 0.0176$	$0.7778 \pm 0.0176$	$0.7885 \pm 0.0163$
4	$0.8875 \pm 0.0632$	$0.9012 \pm 0.0509$	$0.8875 \pm 0.0632$	$0.8807 \pm 0.0699$

Table 5. Full-data diagnostic training results for a single model trained per dataset (no cross-validation), reported to verify that cross-validation trends persist.

Dataset	Accuracy	Precision	Recall	F1
1	0.78	0.79	0.78	0.76
2	0.77	0.78	0.73	0.75
3	0.80	0.82	0.80	0.81
4	0.89	0.89	0.88	0.88

Table 6. Paired t-test results evaluating the effect of internal object patterns on synthetic testbed performance.

Metric	Object Patterns	Non-Object Patterns	p-Value	Significant
Accuracy	$0.6470 \pm 0.1844$	$0.4290 \pm 0.1801$	$7.07 \times 10^{- 4}$	True
F1	$0.6534 \pm 0.2125$	$0.4702 \pm 0.1686$	$5.64 \times 10^{- 3}$	True

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

O’Sullivan, E.; Kennedy, K.; Mohammadi-Aragh, J. Gaining Understanding of Neural Networks with Programmatically Generated Data. Math. Comput. Appl. 2026, 31, 16. https://doi.org/10.3390/mca31010016

AMA Style

O’Sullivan E, Kennedy K, Mohammadi-Aragh J. Gaining Understanding of Neural Networks with Programmatically Generated Data. Mathematical and Computational Applications. 2026; 31(1):16. https://doi.org/10.3390/mca31010016

Chicago/Turabian Style

O’Sullivan, Eric, Ken Kennedy, and Jean Mohammadi-Aragh. 2026. "Gaining Understanding of Neural Networks with Programmatically Generated Data" Mathematical and Computational Applications 31, no. 1: 16. https://doi.org/10.3390/mca31010016

APA Style

O’Sullivan, E., Kennedy, K., & Mohammadi-Aragh, J. (2026). Gaining Understanding of Neural Networks with Programmatically Generated Data. Mathematical and Computational Applications, 31(1), 16. https://doi.org/10.3390/mca31010016

Article Menu

Gaining Understanding of Neural Networks with Programmatically Generated Data

Abstract

1. Introduction

2. Theoretical Foundation

2.1. Neural Networks and the Role of Data

2.2. Convolutional Neural Networks and Data Proxies

2.2.1. CNNs as Pattern Learners

2.2.2. Feature Control via Well-Ordered Feature Groups

2.3. CNN–Apriori Equivalence

2.3.1. Apriori Algorithm Formalization

2.3.2. CNN Pattern Detection Formalization

2.3.3. Equivalence Principle: CNN Weights as Pattern Frequency

2.3.4. Max Pooling as Support Threshold

2.3.5. Frequent Pattern Mining in Image Data

3. Methodology

3.1. Model Creation

Training Configuration

3.2. Dataset Generation

3.3. Synthetic Dataset Generation Procedure

3.4. Feature Pattern System and Dataset Configuration

Dataset Configurations

3.5. Training and Evaluation Procedure

Computational Resources

3.6. Model Metrics and Statistical Evaluation

3.7. Feature Analysis

4. Results

4.1. Cross-Validation Stability

4.2. Generalization to the Well-Related Synthetic Testbed

4.3. Diagnostic Full-Data Training Analysis

4.4. Feature Contribution and Dataset Similarity Validation

4.5. Mathematical Proof Validation: Dataset Similarity Prediction Algorithm

4.5.1. Prediction and Validation Results

4.5.2. CNN Apriori Equivalence Validation

5. Discussion

5.1. Experimental Summary

5.2. Limitations and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI