Genetic-Based Lottery Ticket Pruning for Transformers in Sentiment Classification: Realized Through Lottery Sample Selection

Ryan Bluteau; Robin Gras; Gabriel Peralta

doi:10.3390/a18120798

,

and

School of Computer Science, University of Windsor, Windsor, ON N9B 3P4, Canada

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Algorithms2025, 18(12), 798;https://doi.org/10.3390/a18120798

This article belongs to the Special Issue Advances in Algorithms Through Heuristics: Theory, Applications, and Innovations

Version Notes

Order Reprints

Review Reports

Abstract

In the growing field of Natural Language Processing (NLP), transformers have become excessively large, pushing the boundaries of both training and inference compute. Given the size and widespread use of these models, there is now a strong emphasis on improving both training and inference efficiency. We propose an approach to reduce the computational requirements of transformers. We specifically tested this approach using BERT for sentiment classification. In particular, we reduced the number of attention heads in the model using the lottery ticket hypothesis and an adapted search strategy from a genetic-based lottery ticket pruning algorithm. This search process removes any need for full-sized model training and additionally reduces the training data by up to 95% through lottery sample selection. We achieve leading results in lossless head pruning with a 70% reduction in heads, and up to a 90% reduction with only a 1% F1 loss allocated. The search process was efficiently performed using 5% of training samples under random selection and was further shown to work with just 0.5% of samples by selecting a diverse set of sample embeddings. Inference time was also improved by up to 47.2%. We plan to generalize this work to Large Language Models (LLMs) and language generation tasks to improve both their training and inference requirements.

Keywords:

lottery ticket hypothesis; pruning; lottery sample selection; transformer; sentiment

1. Introduction

The recent development of the transformer [1] has led to state-of-the-art results across most language tasks in Natural Language Processing (NLP). To achieve this, most Large Language Models (LLMs) require a tremendous amount of computation in both training and inference stages (particularly with chain-of-thought models [2]). As models have become valuable as chat-based products and as core components of resource development, this has resulted in a major shift from open-source to closed-source models. However, a few models like the recently released DeepSeek R1 [3] remain fully open-sourced while still being industry leading.

More importantly, DeepSeek-AI et al. [3] provide several distilled variations in the model, allowing for local and private access. Despite these open-source options, one issue remains in the industry: the barrier to creating LLMs locally (from scratch). Many pre-trained models, even those open-sourced, are trained with limitations on allowable content, whether appropriate or not. Unfortunately, these models are far too large to recreate (for example, R1 has 671B parameters), making it impractical to remove such content restrictions through retraining, and pruned or distilled versions may never eliminate them.

In this work, we aim to improve the efficiency of transformers and create a path toward generalizing this approach to LLMs, ultimately providing researchers with a means to independently develop their own models. We make use of the lottery ticket hypothesis, lottery sample selection, and a genetic algorithm to effectively prune an existing pre-trained transformer. In particular, we prune the encoder-based model BERT [4] for sentiment classification, which advances our prior work on pruning tabular models [5]. Naturally, future work will focus on increasing model size (trained from scratch) and generalizing to language generation.

We also make use of a popular sentiment analysis dataset, Sentiment140 [6], which contains 1.6 M samples labeled for positive or negative sentiment. Sentiment classification is widely used in many areas such as ratings (product or media reviews), social network analysis (comment and user sentiment toward topics, media, products, etc.), recommendation systems, and more.

We have also conducted preliminary work on testing BERT and its hyperparameters for sentiment classification tasks [7]. This work introduced a 0-shot labeling strategy for classification tasks. We adopt both the hyperparameter tuning and 0-shot labeling strategy from that work to augment our sentiment data with auxiliary labels and to mine new datasets for additional training data. Given an auxiliary label and a sample text, the 0-shot model ingests the data in the following format: “[CLS] premise [SEP] hypothesis [SEP]”, where the premise is the sample text and the hypothesis is “This example is auxiliary label”. For example, if the sample text were “Today is a great day” and we wanted to predict for the label “unhappy”, then we provide the model with the following: “[CLS] Today is a great day. [SEP] This example is unhappy. [SEP]”. This then uses the model’s next-sentence prediction mechanism to determine whether the hypothesis follows the premise, which directly translates into whether the auxiliary label applies to the sample text.

Putting all our work together, we create an approach to efficiently search head configurations in BERT to optimize its size for sentiment classification. Our method uses a genetic algorithm to search the head configurations using only a small fraction of training samples (5%). We then augment the final pruned model’s performance using data augmentation through 0-shot labeling.

Contributions:

In total, 70% lossless pruning for F1 sentiment classification.
In total, 90% pruning with ∼0.684% F1 loss.
Up to 47.2% inference improvement without increasing batch size.
Genetic pruning searches using 0.5% of training data (0.35% of the total dataset).
No training required on a full-sized model with the full dataset (immediately test with small models on partial datasets).
Zero-shot labeling for full datasets in classification tasks (emotion and sentiment tested) without the cost of an LLM.
Introduced a new topic-based compression strategy using the LDA algorithm.

This paper is organized as follows: We introduce previous works on head pruning and lottery ticket pruning in Section 2, we explain the methodology of our approach in detail in Section 3, we present our experimentation and discussion of results in Section 4, and finally, we conclude in Section 5.

2. Previous Works

The paper introducing the lottery ticket hypothesis [8] describes a process to identify neural network weights that are considered to be lottery weights. A lottery weight is simply a weight that is part of a subnetwork of the original network where the subnetwork performs as accurately as the original when both are trained independently. What differentiates lottery ticket pruning from a typical pruning approach is how the weights are trained. Instead of simply removing poor-quality weights from a trained network to establish a subnetwork, the weights undergo a second phase in which they are reset to their original untrained values and the subnetwork is retrained as a whole. This impacts accuracy significantly, as simply removing weights would degrade accuracy due to lost information (from the training data), while retraining the subnetwork from scratch can retain that information.

There have been many works demonstrating that the lottery ticket hypothesis holds for a variety of networks by showing subnetworks that are over 80% pruned yet remain as accurate, if not more accurate, than the original network. For example, it has been shown that vision models like convolutional neural networks (CNNs) can be pruned to at least 80–90% [8,9,10]. The hypothesis has also been tested on pre-trained networks typical for NLP, such as BERT [4,11,12,13]. When pruned using pre-trained weights, these models are typically evaluated on downstream tasks rather than the original language-learning objective.

However, many approaches used to prune a pre-trained network typically rely on a mask [11,13,14,15] or a learned mask [16,17]. The mask acts as a second layer to the network, increasing its size, and outlines which weights are allowed to retain information. Any area with a zero value in the mask is effectively equivalent to pruning that weight physically from the network, with the caveat that the weight (plus the mask) is still processed. This is a common and convenient approach since it is easy to implement and cleanly demonstrates the validity of the hypothesis.

Our goal is to show that the lottery ticket hypothesis can effectively reduce computational time (particularly at inference). This requires a structural manipulation of the network that physically removes pruned weights. In our previous work [5], we removed weights of a tabular neural network by initializing a new subnetwork and copying selected lottery weights into the new structure. It was shown that this creates smaller networks (>90% pruned) that were sometimes more accurate than the original, with the added benefit of improved inference speed.

However, this approach is unrealistic for larger models like the transformer. Instead, several works remove larger components of the model, such as attention heads, rather than individual weights or nodes. This idea was introduced in the paper “Are Sixteen Heads Really Better than One?” [18], which shows that transformer heads in BERT are largely redundant. This is a costly part of the transformer, since attention is an

O (n^{2})

operation where n is the number of input tokens, and thus directly impacts the context window of the model. Michel et al. [18] “prune up to […] 40% of heads from […] BERT (respectively), without incurring any noticeable negative impact” (p. 6).

Furthermore, Parnami et al. [19] pruned “40% of the attention heads in the BERT transformer model with no loss in accuracy” (p. 1) for sentiment classification using A* search to find the best head configuration. Their search heuristic was the accuracy impact of the model when removing the current head from the current state, guiding the search by the observed accuracy drop. Behnke and Heafield [20] demonstrated that the approach works for translation, reporting that they pruned “(50–72%) in a large transformer with no significant damage to translation quality” (p. 2672), where 50% represents no significant damage. For the GLUE benchmark, Zhang et al. [21] used KL divergence to achieve a 50% head-pruning rate. Li et al. [22] also used the GLUE benchmark and achieved a 58% prune rate using a differentiable mask, followed by distillation to compress the network.

Like in our prior work [5], we use a genetic algorithm (as do several other works [23,24,25,26]) to find the lottery ticket weights. In addition, we perform lottery sample selection to guide pruning. This time, we prune BERT to show that our approach works for pre-trained language models using the head-pruning strategy. This work validates the lottery sample selection approach, which optimizes the search process for both large models and large datasets, and provides a path toward generalizing to LLMs.

3. Methodology

In this section, we introduce the methodology of our head-pruning approach to prune BERT for sentiment classification. First, we discuss the datasets involved, then our data augmentation strategy, and finally the head-pruning approach based on a genetic algorithm, the lottery ticket hypothesis, and lottery sample selection.

3.1. Datasets

We primarily use two large datasets, Sentiment140 [6] and Toxicity [27]. Sentiment140 is our main dataset for training and evaluating the accuracy of our approach. The dataset has 1.6 M samples evenly labeled for positive or negative sentiments. Note that although the dataset description mentions a third neutral label, none appear in the training data. We intentionally selected a binary classification task with an even class distribution. While our goal is to generalize this work to any classification dataset, future work will focus on evaluating this approach on additional tasks, particularly language generation.

The toxicity dataset is used primarily for our data augmentation strategy. It contains roughly 1.8 M samples labeled for toxicity, of which fewer than 100,000 are toxic. The samples come from a social network aimed at reducing toxicity, and the platform provided the ratings. The dataset also includes many auxiliary labels related to the authors. Despite this, we treat the dataset simply as a source of additional text, discarding all labels. Like Sentiment140, the toxicity dataset is not centered around any single category or topic, making it a natural augmentation source.

We also tested augmentation strategies such as translations and text replacement. For example, we chained 5–20 translations that loop back to English and/or replaced the end of a sample (10–90%) with a GPT-2-generated sequence. We found that applying more transformations can change the sentiment of the sample by 30–40% (using 0-shot predicted sentiment [7] as a reference). Because of the computational cost and the degree of sentiment drift introduced, we opted to use naturally occurring text such as the toxicity dataset instead. In short, human-generated data appears to be the best training source for transformers.

Data Augmentation

Since the toxicity dataset does not include sentiment labels, we generate them using our 0-shot algorithm [7], which can create auxiliary classification labels such as emotions using next-sentence prediction. The method formulates a synthetic “next” sentence correlated with the desired auxiliary label. Although direct prediction of sentiment using “positive” and “negative” is possible, the approach is significantly stronger when using many similar labels (e.g., emotions). This acts similarly to an ensemble, improving accuracy as the number of predictors increases. Accuracy gains were shown to plateau around 40 emotions, so we selected 20 emotions for each class (positive and negative). See Table 1 for the list of emotions used.

Table 1. On the left are the positive sentiment auxiliary emotion labels. On the right are the negative labels.

To produce a sentiment label, we compare several approaches: direct sentiment prediction; best individual emotion prediction (using the “unhappy” emotion); SoftMax of “positive” and “negative” sentiment labels; compression of labels using an autoencoder (AE), variational autoencoder (VAE), and principal component analysis (PCA); and finally, our custom topic-based compression using latent Dirichlet allocation (LDA). Compression strategies reduce the emotion predictions to one dimension, which is then interpreted as positive or negative using a 0.5 sigmoid threshold.

Our custom LDA compression acts as an unsupervised sentiment classifier and can, in principle, work for any text classification task given a list of classes and associated auxiliary labels. Each sample is assigned emotions if the prediction is >0.5, producing a mix of positive and negative emotions.

Our goal is to use topic clustering on the input documents and their emotion assignments to generate sentiment labels. See Figure 1 for an overview. Although topics do not inherently correlate with sentiment, we can create sentiment-aware topics by prepending the detected emotions to each sample. Since LDA does not rely on word order, the presence of these emotion labels contributes to strong topic formation, causing topics to include emotion words. Figure 2 shows an example classification.

Figure 1. Overview of our custom LDA compression strategy to create sentiment labels from topic clustering and zero-shot emotions.

Figure 2. Example of making a prediction using our LDA classifier. Highlighted for clarity, in red are examples of negative emotions and blue examples of positive emotions.

The topics formed are collections of words with assigned weights. If a document contains many of a topic’s words, they contribute to a final score indicating the probability that the document belongs to that topic. After forming topics, we assign each sample to the topic with the highest probability. We then calculate class probabilities within the topic by filtering out all non-emotion components, leaving only emotion words and their associated topic weights. We sum the positive weights and negative weights separately; whichever sum is higher determines whether the sample is considered positive or negative.

We apply all approaches listed above to generate sentiment labels for the Sentiment140 dataset, then compare them against the actual labels. Table 2 shows the results. Most approaches outperform direct sentiment prediction, with relatively small differences across compression strategies. We keep all compressed labels and generate sentiment labels for the toxicity dataset as well. The best approach is selected later during pruning. We find that the LDA approach significantly outperforms the other strategies when the goal is model compression.

Table 2. Comparison of approaches on sentiment compression. We used the same train/valid/test split as the lottery search to tune the algorithms for AE, VAE, and LDA and listed hyperparameters. The remaining approaches are 0-shot scores.

3.2. Genetic Lottery Transformer Head Pruning

We now focus on the pruning aspect of our approach. At this stage in our methodology, we have two datasets: one labeled with actual sentiment (Sentiment140), and both labeled with 0-shot generated sentiment (Sentiment140 and toxicity). See Figure 3 for an overview of our data preparation.

Figure 3. Flowchart of the sentiment and toxicity dataset preparation. This process creates artificial labels for both datasets using 0-shot generated emotion labels followed by compression to sentiment labels. Note that no label leakage occurs since the labels are all 0-shot emotion predictions from the same model (BERT) without knowledge of the original dataset labels. This is why we can run the same process on the toxicity dataset which has no sentiment labels. Blue states are datasets, purple states are processes.

Our pruning algorithm is based on the genetic algorithm design developed in our previous work on tabular models [5]. Here, we adapt the approach to prune BERT’s transformer heads instead of model nodes. While attention heads are not the largest component of the network in terms of parameter count, they are responsible for attention computations, which are

O (n^{2})

operations. The genetic algorithm is largely unchanged except for the removal of node-measurement components (such as norms). See Figure 4 for an overview.

Figure 4. Flowchart of our genetic algorithm implementation. The approach consumes just a few samples for training and evaluates on the validation dataset to determine fitness. The test set is only used once a top-performing model is found.

We run the algorithm until validation F1 plateaus, typically at 40–80 epochs (more for higher prune rates/smaller models) with a population of 40–70 (again, more for higher prune rates). We use a 70:15:15 train/valid/test split, with 95% of training held back during the genetic search and 99.5% held back for subsequent evaluation. Thus, the genetic search uses the following splits: 3.5:15:15 (66.5% held back) and 0.35:15:15 (69.65% held back) in our major tests, making the validation split the main cost of the search. The same random selection of samples was used across all runs.

We drop the bottom 10% of the population, then apply a counter-crossover algorithm with a 50% crossover rate. We also add 10% global elites. All individuals, including elite members, undergo one mutation.

We illustrate our evaluation approach in Figure 5. For each generation, we consider the top-performing individual. We do not perform a full refinement step (training on all data) for every individual, only for top performers that demonstrate meaningful improvement on the 5% data used during pruning. Once improvement plateaus (typically around 40 generations), we train the top model on all augmented data and 0-shot compressed labels (Sentiment140 + full toxicity), then fine-tune using Sentiment140 with actual labels. During the genetic search, each individual’s training is restricted to 5% of the dataset, so training takes roughly one minute. See Figure 6 for an illustration of the efficiency gained. Individuals are independent, so the algorithm can be parallelized per population, enabling evaluation of an entire generation in just a few minutes on suitable hardware.

Figure 5. Flowchart of our evaluation of pruned models and how our datasets are used. The genetic algorithm extracts top-performing individuals, which are then evaluated and trained on additional labels from our 0-shot generation process. Blue states are datasets, purple states are processes.

Figure 6. We compare through illustration the efficiency of various pruning styles. Red highlights our illustration of compute required as a percentage of the square, vertical change being model efficiency and horizontal being data efficiency. On the top is masked pruning, which does not remove any parameters (in fact, it requires more computation) and uses the entire dataset. In the middle is the typical head-pruning strategy, which removes parameters but still uses the entire dataset. To enable genetic search (bottom), we use lottery sample selection, which reduces both parameters and training data during search.

Finally, we designed an approach to improve on random lottery sample selection. In a secondary test, we select fewer than 5% of the training samples to train and evaluate individuals. The selection is not random; instead, it is based on embeddings produced by BERT. We cluster the training samples using K-means, with K equal to the desired number of lottery samples, and select the sample at the center of each cluster. This selects a diverse set of representative samples, reducing redundancy and enabling smaller training sets. This serves as a proof of concept for lottery sample selection; future work may explore other clustering methods or alternative selection criteria.

Genetic Algorithm

Each individual is a binary vector of length H (number of heads). A value of 1 means the head is kept; 0 means pruned. All individuals maintain a fixed prune rate. We include the genetic search process outlined in our methodology as Algorithm 1. Table 3 outlines a list of hyperparameters used in our algorithms.

Table 3. Genetic algorithm hyperparameters.

We maintain a global head-score vector

G \in R^{H}

. Let

E

be the set of elite individuals in the current generation, each with binary mask

m \in {0, 1}^{H}

and fitness

f (m)

(validation F1). We define normalized elite weights

w (m) = \frac{f (m)}{\sum_{m^{'} \in E} f (m^{'})},

and update G by

G \leftarrow (1 - α) G + α \sum_{m \in E} w (m) m,

with smoothing parameter

α = 0.1

.

Algorithm 1 Genetic Algorithm for Transformer Head Pruning

1:: Input: prune rate p, population size N, crossover rate c, mutation operator $Mutate$ , lottery samples $D_{5 %}$ , validation set $D_{v a l}$
2:: Initialize population $P$ with N binary masks satisfying prune rate p
3:: Initialize global head-score vector $G \leftarrow 0$
4:: Initialize empty global set of elite individuals as $E$
5:: repeat
6:: for each individual $m \in P$ do
7:: Prune model using mask m
8:: Train pruned model on $D_{5 %}$
9:: $f (m) \leftarrow$ F1 on $D_{v a l}$
10:: end for
11:: Sort $P$ by $f (m)$
12:: Remove bottom 10% P
13:: Insert top 10% of $E$ into P
14:: Record unique top 10% from P into $E$
15:: $G \leftarrow (1 - α) G + α \sum_{m \in E} (\frac{f (m)}{\sum_{m^{'} \in E} f (m^{'})}) m$
16:: Compute selection probabilities $P (m) = f (m) / \sum_{m^{'} \in P} f (m^{'})$
17:: for each parent pair $(m_{i}, m_{j})$ drawn according to $P (m)$ do
18:: Perform counter-crossover, weighted by G
19:: Apply $Mutate$ (swap 1 kept head with 1 pruned head)
20:: Repair mask to enforce exact prune rate
21:: if $C h i l d_{1}$ and/or $C h i l d_{2}$ do not exist in historical populations then
22:: insert into next generation $P_{n e w}$
23:: end if
24:: end for
25:: Record the best individual in $P$ in set $P e a k$ if it reaches a new peak validation score
26:: until validation F1 plateaus for 5 generations or max generation count exceeded
27:: return best individuals recorded in $P e a k$

In addition, we include our LDA-based topic compression for unsupervised sentiment label generation as Algorithm 2.

Algorithm 2 LDA Topic-Based Compression for Unsupervised Sentiment Labeling

1:

Input: corpus D, emotion set E, 0-shot emotion probabilities

P (e | d)

2:

for each document

d \in D

do

3:

Append emotion words

{e \in E : P (e | d) > 0.5}

to d

4:

end for

5:

Train LDA to obtain

Document-topic matrix $T_{d o c}$
Topic-word matrix $T_{w o r d}$

6:

for each document d do

7:

z_{d} \leftarrow arg max T_{d o c} [d]

[most probable topic]

8:

Extract emotion words and weights from topic

z_{d}

9:

S_{p o s} \leftarrow

sum of weights for positive emotions

10:

S_{n e g} \leftarrow

sum of weights for negative emotions

11:

if

S_{p o s} > S_{n e g}

then

12:

label (d) \leftarrow positive

13:

else

14:

label (d) \leftarrow negative

15:

end if

16:

end for

17:

return sentiment labels for all documents

4. Results and Discussion

We set up our experiments by testing every 10% prune rate (plus a test for 95%). This means the entire genetic algorithm was run multiple times, once per prune rate, with the prune rate held fixed during each search. See Figure 7 for our results. The results show that we exceed random pruning in all cases. We also observe that the genetic algorithm alone can retain F1 with only minor loss up to 50–60% pruning rates.

Figure 7. F1 test set evaluation of our pruning algorithm (GA) at various prune rate intervals. These tests come in 3 phases: the red line is GA without 0-shot data augmentation, the gray line uses augmentation with “unhappy” as the 0-shot label, and the green line uses our custom LDA compression for data augmentation. The dashed line is the original BERT model’s score before any pruning. We also include a random pruning test (pink).

For results with augmentation, see Table 4, which indicates that LDA is among the top compression strategies at the 70% prune rate. With the additional 0-shot augmentation styles, we find that our custom LDA approach can surpass the original model’s F1 up to a 70% prune rate. In addition, with a 1% loss allocation, we can achieve up to a 90% prune rate.

Table 4. Comparison of compression approaches when evaluated on GA at a 70% prune rate (PR).

We compare these results with other head-pruning strategies outlined in the previous works section [18,19,20,21,22]. Table 5 summarizes the lossless prune rate and the prune rate achieved with a 1% loss allocation. The comparison shows that we achieve the best lossless prune rate at 70%, and we are second for a 1% loss allocation at 90% (compared to 92%). Note that we tested prune rates in 10% intervals (plus 95%). We had a loss of 0.684% F1 at 90% pruned, so there is room to slightly increase the prune rate to make full use of the 1% loss allocation, meaning our prune rate is not as finely tuned as the A* search result (92%).

Table 5. Comparison of different approaches on lossless prune rate (and 1% loss). Bolded values represent the best prune rate achieved in each category. Our approach is listed below the horizontal line.

In terms of efficiency, we also recorded the training and inference speed of our pruned models (see Figure 8) using the same 5% training data. A regular full-sized BERT model trained on the full dataset can exceed an hour of training per epoch. We used four V100 NVIDIA GPUs, with times displayed in minutes for both training and inference. This setup allowed us to increase search speed with larger batch sizes, although the pruned models themselves do not require four GPUs. We show that at 70% and 90% pruning, inference time is reduced by 16.5% and 47.2%, respectively (without increasing batch size). Training time is reduced to nearly one minute at higher prune rates due to lottery sample selection. We also include the final prune configurations (70%, 80%, and 90%) in Figure 9, which highlight which heads were removed or kept. Note that we remove an entire encoder layer at 90% pruned.

Figure 8. Inference and training time in minutes on the y-axis, and prune rate on the x-axis.

Figure 9. Final prune configurations discovered using the genetic algorithm. Each row represents an encoder layer, and each cell represents an attention head. Green cells indicate heads that were kept. If no heads are kept in an encoder, the encoder is dropped entirely.

All our results are based on random lottery sample selection, meaning we have no strategy for determining which samples are chosen. Random selection is generally sufficient for selecting a diverse set of samples. However, when the selection rate decreases or when datasets are small, a more robust approach is needed to ensure diversity and avoid selecting low-quality samples. Thus, we evaluate a more robust approach based on embedding selection. We cluster sample embeddings (generated by BERT) using K-means and select the samples closest to the cluster centers, ensuring an even selection across clusters. This selects a diverse set of samples based on BERT’s representation of the data.

Table 6 compares search accuracy using random selection versus embedding-based selection. We show that we can reduce the number of training samples by an order of magnitude, down to 0.5% of the training data, with only a minor loss in F1. Embedding selection also improves the quality of selected samples at smaller selection rates such as 1% and 0.5%.

Table 6. Comparison of sample selection approaches using random and embedding selection. Results shown are GA scores at 70% pruned and do not use data augmentation.

5. Conclusions and Future Work

In conclusion, we presented a general approach to transformer head pruning for classification tasks. We use a genetic algorithm to prune BERT for sentiment classification and take advantage of lottery sample selection, which greatly improves the efficiency of the search process (training a pruned model in about one minute) and enables a genetic-style search. We show that we can find prune variants in BERT up to 70% smaller with lossless F1, and up to 90% smaller with a 1% loss allocation. When compared to other approaches, we demonstrate state-of-the-art results among head-pruning strategies, with a particularly strong lead in lossless head pruning (70% vs. 58%), even compared to approaches that perform distillation after pruning. We also test the limits of lottery sample selection and show that we can improve the search process by an order of magnitude, down to 0.5% of the training data. This is in addition to never training the original full-sized network, making our approach suitable for extremely large networks (such as LLMs) since we operate solely on pruned variants.

Future work opens several directions. One option is to distill the full-sized model onto our prune configurations to further improve lossless accuracy. Another direction is to expand the generalization of our approach to text generation tasks, which will require additional augmentation work. We can also attempt to prune beyond the transformer heads to improve inference speed. Our approach was originally tested on neural network nodes, so it can be adapted to transformer nodes to prune the feed-forward layers, which are among the largest components of the transformer. We can also deepen our investigation into lottery sample selection by improving how training data is selected. Lottery sample selection can be viewed as a lottery pruning approach applied to data instead of model parameters, suggesting that some pruning techniques could be adapted to evaluate samples (for instance, mapping our GA’s DNA to samples instead of heads or nodes). Finally, we can test the approach on LLMs (once generalized to text generation tasks), since the method does not rely on the full-sized network. In that case, improving the search algorithm becomes important, as the search complexity grows even though the original model size does not affect computational requirements.

Author Contributions

Conceptualization: R.B. and R.G.; Methodology: R.B. and R.G.; Formal Analysis and Investigation: R.B. and G.P. (comparison of label compression approaches); Writing—Original Draft Preparation: R.B.; Writing—Review and Editing: R.B. and R.G.; Funding Acquisition: R.G.; Resources: R.G.; Supervision: R.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All data is publicly available and linked or referenced within the manuscript.

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
DeepSeek-AI; Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Bluteau, R.; Gras, R. Lottery Ticket Search on Untrained Models with Applied Lottery Sample Selection. Mach. Learn. Knowl. Extr. 2023, 5, 400–417. [Google Scholar] [CrossRef]
Go, A.; Bhayani, R.; Huang, L. Twitter sentiment classification using distant supervision. CS224N Proj. Rep. Stanf. 2009, 1, 2009. [Google Scholar]
Bluteau, R.; Gras, R. Improving Sentiment Classification Using 0-Shot Generated Labels for Custom Transformer Embeddings. Eur. J. Artif. Intell. 2025, 1–13. [Google Scholar] [CrossRef]
Frankle, J.; Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks (2018). arXiv 2019, arXiv:1803.03635. [Google Scholar]
Morcos, A.S.; Yu, H.; Paganini, M.; Tian, Y. One ticket to win them all: Generalizing lottery ticket initializations across datasets and optimizers. arXiv 2019, arXiv:1906.02773. [Google Scholar] [CrossRef]
Girish, S.; Maiya, S.R.; Gupta, K.; Chen, H.; Davis, L.S.; Shrivastava, A. The lottery ticket hypothesis for object recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 762–771. [Google Scholar]
Chen, T.; Frankle, J.; Chang, S.; Liu, S.; Zhang, Y.; Wang, Z.; Carbin, M. The Lottery Ticket Hypothesis for Pre-trained BERT Networks. In Proceedings of the Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 15834–15846. [Google Scholar]
McCarley, J.S.; Chakravarti, R.; Sil, A. Structured Pruning of a BERT-based Question Answering Model. arXiv 2019, arXiv:1910.06360. [Google Scholar] [CrossRef]
Prasanna, S.; Rogers, A.; Rumshisky, A. When BERT Plays the Lottery, All Tickets Are Winning. arXiv 2020, arXiv:2005.00561. [Google Scholar] [CrossRef]
Chen, X.; Cheng, Y.; Wang, S.; Gan, Z.; Wang, Z.; Liu, J. EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets. arXiv 2021, arXiv:2101.00063. [Google Scholar]
Kim, D.; Kim, M.S.; Shim, H.; Lee, J. Your lottery ticket is damaged: Towards all-alive pruning for extremely sparse networks. Inf. Sci. 2023, 634, 608–620. [Google Scholar] [CrossRef]
Liu, Y.; Meng, F.; Lin, Z.; Fu, P.; Cao, Y.; Wang, W.; Zhou, J. Learning to Win Lottery Tickets in BERT Transfer via Task-agnostic Mask Training. arXiv 2022, arXiv:2204.11218. [Google Scholar] [CrossRef]
Gao, Y.; Colombo, N.; Wang, W. Adapting by Pruning: A Case Study on BERT. arXiv 2021, arXiv:2105.03343. [Google Scholar] [CrossRef]
Michel, P.; Levy, O.; Neubig, G. Are Sixteen Heads Really Better than One? arXiv 2019, arXiv:1905.10650. [Google Scholar] [CrossRef]
Parnami, A.; Singh, R.; Joshi, T. Pruning Attention Heads of Transformer Models Using A* Search: A Novel Approach to Compress Big NLP Architectures. arXiv 2021, arXiv:2110.15225. [Google Scholar] [CrossRef]
Behnke, M.; Heafield, K. Losing Heads in the Lottery: Pruning Transformer Attention in Neural Machine Translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; Webber, B., Cohn, T., He, Y., Liu, Y., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2020; pp. 2664–2674. [Google Scholar] [CrossRef]
Zhang, Z.; Qi, F.; Liu, Z.; Liu, Q.; Sun, M. Know What You Don’t Need: Single-Shot Meta-Pruning for Attention Heads. arXiv 2020, arXiv:2011.03770. [Google Scholar] [CrossRef]
Li, B.; Wang, Z.; Huang, S.; Bragin, M.A.; Li, J.; Ding, C. Towards Lossless Head Pruning through Automatic Peer Distillation for Language Models. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23), Macao, China, 19–25 August 2023; pp. 5113–5121. [Google Scholar]
Wang, Z.; Li, F.; Shi, G.; Xie, X.; Wang, F. Network pruning using sparse learning and genetic algorithm. Neurocomputing 2020, 404, 247–256. [Google Scholar] [CrossRef]
Mantzaris, D.; Anastassopoulos, G.; Adamopoulos, A. Genetic algorithm pruning of probabilistic neural networks in medical disease estimation. Neural Netw. 2011, 24, 831–835. [Google Scholar] [CrossRef] [PubMed]
Hancock, P.J. Pruning neural nets by genetic algorithm. In Artificial Neural Networks; Elsevier: Amsterdam, The Netherlands, 1992; pp. 991–994. [Google Scholar]
Yang, C.; An, Z.; Li, C.; Diao, B.; Xu, Y. Multi-objective pruning for cnns using genetic algorithm. In Proceedings of the International Conference on Artificial Neural Networks, Munich, Germany, 17–19 September 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 299–305. [Google Scholar]
Adams, C.J.; Borkan, D.; Sorensen, J.; Dixon, L.; Vasserman, L.; Thain, N. Jigsaw Unintended Bias in Toxicity Classification. Kaggle. 2019. Available online: https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification (accessed on 1 December 2025).

Figure 1. Overview of our custom LDA compression strategy to create sentiment labels from topic clustering and zero-shot emotions.

Figure 2. Example of making a prediction using our LDA classifier. Highlighted for clarity, in red are examples of negative emotions and blue examples of positive emotions.

Figure 3. Flowchart of the sentiment and toxicity dataset preparation. This process creates artificial labels for both datasets using 0-shot generated emotion labels followed by compression to sentiment labels. Note that no label leakage occurs since the labels are all 0-shot emotion predictions from the same model (BERT) without knowledge of the original dataset labels. This is why we can run the same process on the toxicity dataset which has no sentiment labels. Blue states are datasets, purple states are processes.

Figure 4. Flowchart of our genetic algorithm implementation. The approach consumes just a few samples for training and evaluates on the validation dataset to determine fitness. The test set is only used once a top-performing model is found.

Figure 5. Flowchart of our evaluation of pruned models and how our datasets are used. The genetic algorithm extracts top-performing individuals, which are then evaluated and trained on additional labels from our 0-shot generation process. Blue states are datasets, purple states are processes.

Figure 6. We compare through illustration the efficiency of various pruning styles. Red highlights our illustration of compute required as a percentage of the square, vertical change being model efficiency and horizontal being data efficiency. On the top is masked pruning, which does not remove any parameters (in fact, it requires more computation) and uses the entire dataset. In the middle is the typical head-pruning strategy, which removes parameters but still uses the entire dataset. To enable genetic search (bottom), we use lottery sample selection, which reduces both parameters and training data during search.

Figure 7. F1 test set evaluation of our pruning algorithm (GA) at various prune rate intervals. These tests come in 3 phases: the red line is GA without 0-shot data augmentation, the gray line uses augmentation with “unhappy” as the 0-shot label, and the green line uses our custom LDA compression for data augmentation. The dashed line is the original BERT model’s score before any pruning. We also include a random pruning test (pink).

Figure 8. Inference and training time in minutes on the y-axis, and prune rate on the x-axis.

Figure 9. Final prune configurations discovered using the genetic algorithm. Each row represents an encoder layer, and each cell represents an attention head. Green cells indicate heads that were kept. If no heads are kept in an encoder, the encoder is dropped entirely.

Table 1. On the left are the positive sentiment auxiliary emotion labels. On the right are the negative labels.

Positive		Negative
happy	buoyant	unhappy	heart-sick
joyous	cheerful	infelicitous	mournful
light-hearted	gratified	unfortunate	lugubrious
gay	joyful	miserable	deplorable
blissful	beatific	wretched	calamitous
glad	rapturous	sad	distressing
delighted	felicitous	sorrowful	grievous
elated	blithe	disconsolate	inauspicious
merry	halcyon	woeful	unpropitious
ecstatic	charmed	distressed	malign

Table 2. Comparison of approaches on sentiment compression. We used the same train/valid/test split as the lottery search to tune the algorithms for AE, VAE, and LDA and listed hyperparameters. The remaining approaches are 0-shot scores.

	Test Score	Epochs Passes	BS Topics	LR Update	Optimizer	Hidden Dimensions
Direct	76.69%
SoftMax	77.15%
Unhappy	78.13%
AE	78.74%	30	32	0.002	rmsprop	[256, 128, 64]
VAE	78.75%	20	32	0.001	adam	[16, 4]
PCA	78.69%
LDA-100	78.45%	2	256	2
LDA-5	78.07%	5	128	1

Table 3. Genetic algorithm hyperparameters.

Hyperparameter	Value/Description
Population size	40–70
Encoding	144-bit mask for heads (BERT-base: 12 layers x 12 heads)
Elite fraction	10%
Discard fraction	10%
Training per individual	Tested 5% and 0.5% of training set
Fitness	Validation F1 (15%)
Crossover rate	50%
Crossover type	Counter-crossover, prune-rate preserving
Mutation	Swap one kept head with one pruned head
Global state update	EMA with $α = 0.1$
Termination	5-generation F1 plateau or hard-set limit
Generations	Typically 40–80 (up to 100 at high prune rates)

Table 4. Comparison of compression approaches when evaluated on GA at a 70% prune rate (PR).

	Orig.	GA No Aug.	Unhappy	PCA	AE	VAE	LDA
PR	0%	70%	70%	70%	70%	70%	70%
F1 Test	0.86963	0.86703	0.86910	0.86919	0.86964	0.86888	0.87026
% Diff	0	−0.299	−0.061	−0.05	0.001	−0.086	0.073

Table 5. Comparison of different approaches on lossless prune rate (and 1% loss). Bolded values represent the best prune rate achieved in each category. Our approach is listed below the horizontal line.

Approach	Lossless PR	Loss PR (1%)	Task
Change in Loss [18]	40%	60%	Entailment
Iterative [20]	50%	72%	Translation
A* Search [19]	41%	92%	Sentiment/Reviews
KL Divergence [21]	N/A	50%	GLUE
Loss + Distil [22]	58%	N/A	GLUE
GA + 0-shot Augmentation	70%	90%	Sentiment

Table 6. Comparison of sample selection approaches using random and embedding selection. Results shown are GA scores at 70% pruned and do not use data augmentation.

Selection	Percentage Selected	# of Samples Selected	70% Pruned F1	% Diff to Rand
Random	5%	56,000	0.86703	N/A
Embedding	5%	56,000	0.86695	−0.009
Random	1%	11,200	0.86576	N/A
Embedding	1%	11,200	0.86662	+0.099
Random	0.5%	5600	0.86508	N/A
Embedding	0.5%	5600	0.86594	+0.099

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Genetic-Based Lottery Ticket Pruning for Transformers in Sentiment Classification: Realized Through Lottery Sample Selection

Abstract

1. Introduction

2. Previous Works

3. Methodology

3.1. Datasets

Data Augmentation

3.2. Genetic Lottery Transformer Head Pruning

Genetic Algorithm

4. Results and Discussion

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics