1. Introduction
The concept of hypernetworks [
1,
2,
3,
4,
5,
6,
7] is used to describe a higher-level model that is capable of producing a separate neural network by generating model weights. It is a type of meta-learning architecture where the hypernetwork produces weights for a new network or target network.
Implementing a hypernetwork using neural networks would therefore require neural networks that can perform tasks they were not necessarily trained on. Hypernetworks are capable of generating entirely new models that do not require training and are not informed through any type of transfer learning or one/few shot learning approach. This shifts neural networks from their initial design and requires new training and inference techniques that can satisfy the challenging needs of hypernetworks.
Although several approaches have been proposed to address hypernetworks, this area of advanced computing is still generally considered somewhat underexplored. Approaching hypernetworks from a generative framework requires that the training data be given special considerations provided the complexity of the task. While hypernetwork research has explored a variety of approaches, hypernetwork training data is commonly based on conditioning input such as task embeddings, feature distributions, or latent variables [
1]. The training data presents common challenges such as high dimensionality and overall generalization.
Despite the efforts, the development of hypernetworks is a challenging task, and research efforts are still being continued. Here we prepared the first dataset of neural networks designed for hypernetwork research. The ultimate purpose of the dataset is to provide a model that can generate neural networks rather than training them.
Datasets of neural networks have been studied in the past [
8]. A dataset of neural networks can be used to train a classifier to identify the machine learning problem that it solves. For instance, a neural network can classify between neural networks trained on MNIST and neural networks trained on CIFAR [
8]. Other studies aimed at predicting the performance of a neural network classifier [
9]. Some architectures have also been proposed for analyzing weight spaces of neural networks [
10,
11,
12].
While these are based on datasets of neural networks, they were not designed for the purpose of hypernetworks. For instance, the ability to distinguish between a classifier that was trained on MNIST data and a classifier that was trained on CIFAR data does not necessarily provide tools that can be used to generate a classifier in the context of hypernetworks. Therefore, the dataset of neural networks described here is based on a single image dataset, which is Imagenette. Each class in the dataset contains neural networks trained to identify a certain Imagenette class. That is achieved by conceptualizing the problem as a binary classification problem, such that one of the classes contains images from the class of interest, while the other class is a collection of random images from all other classes. Such a dataset can be used to support Generative Adversarial Networks [
13,
14] that instead of generating text or images can ultimately generate neural networks.
A unique trait of hypernetworks is creating efficiency in the training process when compared to traditional methods of feed forward and back propagation cycles. Primary networks that are lighter and contain a smaller number of parameters can produce larger networks containing a higher number of parameters.
The ability to generate neural networks can ideally lead to solutions of AI tasks without the need to train a neural network for each specific task. Since the training of a neural network is often computationally demanding, generating neural networks can provide a faster and more energy-efficient solution to the training of neural networks. Additionally, it can also lead to a more general AI system that does not require the collection of large training sets for each specific task.
The codebase and dataset are available publicly at
https://github.com/davidkurtenb/Hypernetworks_NNweights_TrainingDataset (accessed on 1 July 2025) and
https://huggingface.co/datasets/dk4120/neural_network_parameter_dataset_lenet5_binary/tree/main (accessed on 1 July 2025), respectively. Historically, machine learning research has been driven by the availability of benchmark datasets such as ImageNet [
15], among many others [
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30,
31,
32], that enabled the advancement of the field.
These benchmark datasets served as substantial factors in the rapid progression of machine learning and artificial intelligence. They provide researchers with convenient access to data, allowing researchers to focus on the development of their algorithms. As benchmarks, they also allow researchers to compare the performance of their algorithms by applying different algorithms developed by different research teams to the same datasets. For instance, the sub-field of automatic face recognition was powered by the availability of face datasets such as ORL [
16] or FERET [
17]. Similarly, the task of automatic object recognition benefited substantially from benchmark datasets such as ImageNet [
15], among many others. Since benchmark datasets of neural networks are not yet available, the availability of the open dataset can assist in the advancement of hypernetwork research.
2. Background
While there are multiple research efforts around the study of hypernetworks and their applications, the subfield is somewhat nascent, with ample areas to be further explored. The core idea of leveraging a higher-order neural network that sometimes contains a smaller number of parameters than the target model to generate weights of a separate neural network is a concept that shifts from the “typical” manner neural network are created. The concept gained its initial traction with the work of [
1], looking for frameworks to expand the existing methods of training a neural network.
For instance, hyper-representations with layer-wise loss normalization was used to aggregate knowledge from model zoos [
6]. That allowed one to generate new models based on that knowledge.
Bayesian hypernetworks [
2] provide an expansion of Bayesian deep learning that can transform noise distribution to a distribution with the parameter of a different neural network. It has been demonstrated to be more resistant to adversarial data [
2].
Applications of hypernetworks have seen a number of use cases with variety of applicability. Their potential has spread across multiple domains such as meta-learning, continual learning, neural architecture search, and reinforcement learning [
33,
34,
35,
36]. Particularly, they have the ability to train neural networks in cases of limited training data with few-shot learning.
For instance, hypernetworks have been used to improve continual learning. By using the concept of task-conditioned hypernetworks, it has been shown that is was possible to overcome the problem of catastrophic forgetting in “standard” artificial neural networks trained on several different tasks [
34].
The task of continual learning using hypernetworks was also studied by [
36], using task-conditioned hypernetworks to make learning sufficiently fast. The use of these hypernetworks make on-the-fly learning practical, thereby allowing one to avoid the relatively long response time typical to stationary learning models.
The concept of Graph HyperNetworks was used to identify the most effective neural network architecture for a certain machine learning problem without the computationally challenging need to train and test all of these architectures [
35].
Hypernetworks have also been found effective in representations of conditional sentences [
7], which involve embedding pre-computed conditions into the corresponding layers, allowing the sentence to be handled differently based on the condition.
Hypernetworks have demonstrated a theoretical value in their application to advance continual learning by resolving catastrophic forgetting. Where traditional neural networks have adjusted model weights during the training process, those weights are then static until the model is retrained. Hypernetworks redefine that paradigm by proposing the notion of dynamic weights. The application of a dynamic weight schema serves as a manner to improve network adaptability and performance [
1].
While the study of hypernetworks presents promising potential, they are not without their challenges. Hypernetworks have faced stability and scalability concerns where the models grow increasingly complex [
37]. These challenges are only amplified with computational requirements, which have also been difficult to overcome. The relationship between a hypernetwork and target network must be carefully designed in their architecture. Another significant challenge of neural networks is having access to a robust and relevant training dataset. The work covered in this paper aims to begin resolving this challenge and creating a path toward novel applications of hypernetworks in conjunction with generative approaches.
3. A Dataset of Neural Networks
The dataset contains 10
4 instances of neural networks, divided into a total of 10 classes. Each neural network is a two-way, one-versus-all image classifier, and each class contains 1000 neural networks that can identify the images of that class. The different classes are taken from the
Imagenette dataset [
15], specifically the Imagenette 320 px V2 dataset with classes 0: Tench, 1: English Springer, 2: Cassette Player, 3: Chain Saw, 4: Church, 5: French Horn, 6: Garbage Truck, 7: Gas Pump, 8: Golf Ball, and 9: Parachute.
Imagenette is a well-studied benchmark dataset in a mature stage in its life-cycle. That allows one to minimize risks such as missing data, imbalanced classes, or label accuracy, which can be a problem with new datasets [
38].
The code repository also includes model performance metrics, aggregated by class and performance plots for each of the 104 models. Additionally, to further drive accessibility, the model parameters of the 10,000 LeNet-5 binary classifiers have been compiled into two files. One file condenses the weights and biases by model, referred to as modelwise. The individual parameters for each model are captured by model and provided as a single flattened tensor. The other file captures parameters across classes by layer, referred to as layerwise. Each of the 10 classes parameters are saved by layer. For example, the class “church” and “conv2d” dictionary contains the parameters (combined weights and biases) for the first convolutional layer for all 10,000 LeNet-5 models trained for binary, one-vs.-all classifications of a church.
To generate a dataset of neural networks, each neural network is trained as a two-way classifier. The network is trained such that all images of the first class are images taken from one class of Imagenette. The images of the other class are taken randomly from all other Imagenette classes. Each model training dataset contains 9–10% of the target class.
That leads to 10 classes such that each class contains 1000 neural networks that can identify images of one class from all other classes. The dataset is therefore balanced [
29]. All models were trained for 25 epochs and achieved an average accuracy of 91.5%. Because the images in the other class are selected randomly, every neural network is different. That leads to a dataset of neural networks such that each class contains a large number of neural networks. Each neural network in the dataset was trained with different images, and therefore it is different from the other neural networks in that class.
The architecture that was used for this dataset was LeNet-5 [
18]. The motivation for selecting a relatively simple architecture was to ensure that the generation of the dataset was computationally practical. Another reason was avoiding the curse of dimensionality by using an architecture with a lower number of weights compared to other common architectures such as
ResNet or
VGG. A deeper architecture would have a higher number of parameters, making it more challenging to use it for the purpose of generating new neural networks due to the higher dimensionality.
Training a very large number of neural networks is a computationally intensive task. The training required over twenty seven hours of a powerful computing cluster with more than 10,000 cores. The cluster was made of 1296 cores of Xeon E5-2690, 1296 cores of Xeon E5-2680, 2048 cores of Xeon E5-2683, 2400 cores of Xeon E5-2630, 1823 cores of Xeon Gold 6130, 2176 cores of AMD EPYC 7452, and 96 Nvidia GeForce GTX 2080 Ti, making up a large cluster of a total of 11,039 cores. Xeon processors were manufactured by Intel, Santa Clara, CA, USA. EPYC processors were manufactured by AMD, Santa Clara, CA, USA. GeForce GPU was manufactured by Nvidia, Santa Clara, CA, USA.
Using a deeper architecture with more parameters would have led to a dataset that would be impractical to generate even with a powerful cluster. Additionally, a relatively simple architecture simplifies the analysis and use of the dataset. Such an analysis can include training a neural network that can classify neural networks, or generate neural networks automatically.
Table 1 shows the classification accuracy, precision, recall, and F1 score of the neural networks of the different classes. Since each class contains 1000 neural networks, and each neural network is trained separately using different data, the performance of the neural networks contained in each class is not expected to be identical.
LeNet-5 Model Training Specifications
The proposed dataset of neural networks contains simple neural networks trained through one-versus-all binary classification models. As mentioned in
Section 3, these neural networks follow the LeNet-5 architecture. The total number of trainable parameters for each model is 91,481. For comparison, the number of parameters in the common ResNet-50 architecture is over 2 × 10
6.
Table 2 summarizes the LeNet-5 architecture and the number of parameters.
Each model produced a total of 10 arrays containing alternating model weight and bias information, saved in the format of an hdf5 file. The length of each array varies and ranges in parameters from 1 to 48,000. Using the model weights as a source of training data presents a unique approach to the training of hypernetworks. Because each neural network is trained with different images, the distribution of the weights within each class of model is distinct. Most of the individual model weights were near-zero numbers.
Figure 1 displays the distribution of all weights of all classes, and
Figure 2 displays the weight distributions separated by class. The plots are scaled to highlight the near-zero distributions of each model due to the large concentration of values within this range. The values of the weights are not identical, which can be expected given that each neural network is trained with a different set of images. The distinct curves for a given class provide evidence of the distinct patterns and features calculated across weight values for the object classification LeNet-5 models. The distribution by layer can be found in
Figure A3,
Figure A4,
Figure A5,
Figure A6 and
Figure A7 in
Appendix A.
Table 3 shows the distribution of common weights in the trained neural networks among the different classes.
4. Parameter Distribution
Understanding the distributions and distinction in patterns between layers separated by class is a critical piece in learning characteristics. It is not just the overall distribution by model that is important; one should also look at distributional differences of the LeNet-5 model layers. Analysis was performed to better understand the distributions as well as compare divergence between classes.
As parameters traverse the LeNet-5 architecture, there is an expected reshaping of their distribution. The convolutional approach reduces the total range of distribution, which then undergoes significant transformations as information is passed through the dense layers. Each class has its own unique pattern but follows a similar profile. The Jensen–Shannon (JS) Divergence was used to assess the level of similarity between class parameter layer distributions. The JS divergence by layer is displayed by
Figure A1 and
Figure A2 in
Appendix A. As expected, the parameter distribution at the third and final convolutional layer were the most similar. They demonstrated characteristics that were nearly overlapping when comparing two different classes. However, this is expected as convolutional layers reduce complexity within the distributions and limit the feature space. This is an aspect of convoloutional layers’ ability to focus on spatial relationships and employ weight sharing. On the opposite side, the second dense layer was the most diverse layer. This again is expected as the dense layer is connecting all neurons passed by the first dense layer. The effect is to open up the range of the parameter distribution.
5. Automatic Classification of Neural Networks
To further explore the potential of the dataset in developing hypernetworks, the model weights were used in classification tasks. In demonstrating the ability for the training set to be effectively classified using traditional machine learning and deep learning approaches, one can reason that the training data has ample features within the model weights. This is a primary requirement that leads to the potential of developing hypernetworks with a robust training dataset.
As mentioned in
Section 3, the dataset is fully balanced and contains no missing values. Therefore, classification accuracy higher than mere chance reflects the ability of the classifier to identify between the neural networks. The effectiveness of the classifier was measured by the classification accuracy [
29], as well as the specificity, sensitivity, and F1.
5.1. Classification Methods
Traditional methods of classification were applied to the baseline of the model performance. Given the high dimensionality of the data, a deep learning model was also applied. Classification was completed with using the layer weights and biases, with a total of 91,481 parameters per model. Following standard practices [
39], the experiments were performed such that 70% of the samples were allocated for training, and the rest of the data was used for testing/validation.
The deep neural network that was used is a fully connected multi-layer perceptron, with three hidden layers of sizes of 256, 128, and 64, with batch normalization. The activation functions are ReLU, and the dropout rate was set to 0.6.
5.2. Classification Results
The results for the entire model are summarized in
Table 4. As the table shows, the classification accuracy is far higher than the expected 10% mere chance, showing that the neural networks can be differentiated from each other by their weights. Naive Bayes achieves the highest classification accuracy of 72% (
).
The results observed using deep learning classification capture some of the challenges within the subfield of hypernetworks. The high dimensionality of model weights is challenging to work with and prone to overfitting. Even within this example, practices such as batch normalization, dropout, regularization, random search parameter tuning, and experimentation with model architecture were used with minimal success in terms of improving accuracy.
Figure 3 shows the loss and accuracy of the deep learning model when using all weights. The training/validation loss by layer is shown in
Figure A8 and
Figure A9 in
Appendix A.
6. Discussion
The dataset of 104 neural networks introduced here was designed specifically for hypernetwork research. Therefore, it is important that the neural networks can be distinguishable through an automatic process. That can show that the weights of the different neural networks exhibit different patterns that are identifiable by machine learning algorithms.
An attempt to use a classifier that can predict the class that a neural network identifies showed that the classifier can identify the class through the weights of the neural network at an accuracy far higher than mere chance. That provides an indication that the dataset can be used for studies that involve machine learning.
For the purpose of automatic classification of neural networks, the deep neural network did not perform well compared to other algorithms, while Naive Bayesian networks showed the best performance. Naive Bayes assumes that each parameter is independent and therefore performs well when the input variables are independent from each other [
40]. Weights in a neural network are independent values. For instance, weight in neural networks normally cannot be predicted from other weights, unlike other types of data such as values of pixels in an image. It can therefore be expected that the Naive Bayes provides the best classification accuracy for this specific task.
The fact that the neural networks can be separated using machine learning provides an indication of the existence of patterns in the weights. The expected presence of such patterns is also an indication that such distributions can be produced by generative AI for the purpose of hypernetworks. Generative AI if often used to generate images, audio, video, text, and code [
41]. Tools such as AlphaEvolve [
42] show that it can also be used to generate new algorithms. Here we provide research resources for exploring the contention that generative AI can also be used to generate artificial neural networks.
For the direct purpose of generative AI, the classifier of neural networks shows that a GAN discriminator is possible. The results can also be used as baseline for future algorithms that can classify between neural networks. Improving the classification accuracy can lead to better discriminators.
7. Conclusions
Here we introduced an open dataset for the study of hypernetworks. The generation of the dataset involved substantial computing resources, resulting in neural networks separated into 10 classes based on Imagenette data. The purpose of the dataset is to enable the research of hypernetworks. The dataset is open and available to the public. Using a known dataset such as Imagenette to generate the neural networks will allow one to better understand the nature of the content of the dataset, but it can also allow one to expand the dataset in the future by training new image classes against the Imagenette images.
While datasets of neural networks exist, the dataset described here is designed specifically for the purpose of hypernetwork research. For instance, it is based on a single dataset, rather than an attempt to distinguish between neural networks trained with two completely different datasets [
8]. It also uses the same neural network architecture, as it does not aim at identifying the ideal architecture for a given classification problem [
9].
The dataset of neural networks separated into 10 classes is definitely far smaller than the number of classes and images in a dataset such as ImageNet. Another limitation of the dataset is that it is limited to one CNN architecture. Naturally, large datasets of neural networks require substantial computing resources to generate each sample and are far more demanding than just adding an image sample to a “traditional” dataset. When using a more complex CNN architecture the training can require far more powerful computing resources, and a higher number of parameters. Yet, the dataset can provide research infrastructure for the development of the concept of hypernetworks and can be used for a variety of purposes that include supervised machine learning, unsupervised machine learning, and generative AI.
The dataset is based on the relatively simple LeNet-5 architecture. It can be trained within a reasonable time using a powerful computing cluster. Future benchmarks will include other common architectures such as ResNet, although using more complex architectures with a higher number of parameters will require substantially stronger computing resources. A higher number of parameters will also require more complex hypernetworks that can be trained by these neural networks. That will require stronger computing and longer training, not merely to generate the dataset but also to train the hypernetworks.
Future work will also include the development of GANs that can generate neural networks. While GANs are often used to generate images or text, they can also be used to generate neural networks. That, however, requires a suitable dataset of neural networks that can allow the training of a GAN that generates neural networks. Such GANs will require modification to the commonly used GAN architectures. The availability of datasets of neural networks as described here can enable the development and testing of such GANs.