#
Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data^{ †}

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

- RQ-1: Does a large pretrained language model, in this case, RoBERTa [13], achieve good long-tail class prediction performance (Section 5.1)?
- RQ-2: Can we extend language models such that a small language model can retain accurate long-tail information, with overall training that is computationally cheaper than fine-tuning RoBERTa?
- RQ-3: What are the long-tail prediction performance benefits of small CLMs that unify self-supervised and supervised contrastive learning?

#### Contributions

## 2. Related Work

#### 2.1. Long-Tail Compression

#### 2.2. Contrastive Learning Benefits

#### 2.3. Long-Tail Learning

#### 2.4. Negative and Positive Generation

**negative samples**, since most contrastive learning objectives only use a single positive learning sample and b (bad) negative samples—Musgrave et al. [20] give an excellent overview. However, if too many negative samples are generated they can collide with positive samples, which degrades learning performance [22]. More recent computer vision works like Khosla et al. [23], Ostendorff et al. [24] propose generating multiple

**positive**samples to boost supervised contrastive learning performance, while Wang and Isola [25] show that, when generating positive samples, the representations of positives should be close (related) to each other. Our method builds on these insights and extends them to self-supervised contrastive learning and to the language model domain using a straightforward extension to NCE. Instead of using only one positive example the like standard NCE by Mnih and Teh [26], our method uses g good (positive) samples (see Section 3). To ensure that positive samples are representationally close (related) during self-supervised contrastive pretraining, we use words from a current input text as positive ‘pseudo-labels’—i.e., we draw self-supervision pseudo-labels from a related context. Negative pseudo-labels (words) are drawn as words from other in-batch text inputs, where negative sample words are not allowed not appear in the current text to avoid the above-mentioned collision of positive and negative samples.

#### 2.5. Data and Parameter Efficiency

**Our method is designed to increase self-supervision signal, i.e., by sampling more positive and negatives, to compensate for a lack of large pretraining data (signal)—since rare and long-tailed data is always limited**. It is our goal to skip compression and still train small, long-tail prediction capable models. Notably, CLESS pretraining does not require special learning rate schedules, residuals, normalization, warm-ups, or a modified optimizer as do many BERT variations [13,31,32].

#### 2.6. Label Denoising

## 3. CLESS: Unified Contrastive Self-supervised to Supervised Training and Inference

## 4. Data: Resource Constrained, Long-Tail, Multi-Label, Tag Prediction

- Long-tail evaluation metrics and challenges:

## 5. Results

#### 5.1. (RQ-1+2): Long-Tail Capture of RoBERTa vs. CLESS

#### 5.1.1. RoBERTa: A Large Pretrained Model Does Not Guarantee Long-Tail Capture

#### 5.1.2. CLESS: Contrastive Pretraining Removes the Need for Model Compression

#### 5.1.3. Practical Computational Efficiency of Contrastive Language Modeling

#### 5.2. (RQ-3.1-2): Contrastive Zero-Shot Long-Tail Learning

#### 5.2.1. (RQ-3.1): More Self-supervision and Model Size Improve Zero-Shot Long-Tail Capture

#### 5.2.2. RQ-3.2: Contrastive pretraining Leads to Data-Efficient Zero-Shot Long-Tail Learning

#### 5.3. (RQ-3.3): Few-Shot Long-Tail Learning

## 6. Conclusion

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Appendix A

**Text preprocessing details:**We decompose tags such as ‘p-value’ as ‘p’ and ‘value’ and split latex equations into command words, as they would otherwise create many long, unique tokens. In the future, character encodings may be better for this specific dataset, but that is out of our current research scope. Words embedding are pretrained via fastText on the training corpus text. 10 tag words are not in the input vocabulary and thus we randomly initialize their embeddings. Though we never explicitly used this information, we parsed the text and title and annotated them with ‘Html-like’ title, paragraph, and sentence delimiters, i.e. </title>, </p>, and </s>.

## Appendix B

**Table A1.**Time complexity O(Layer), data-efficiency, number of trainable parameters, number of all parameters. The data-efficiency of Convolutions (*) is reported in various works to be superior to that of self-attention models [28,30,52,53,54,55,56]. d is the input embedding size and its increase slows down convolutions. n is the input sequence length and slows down self-attention the most [57]. There exist optimizations for both problems.

Layer Type | $O\left(Layer\right)$ | Literature Reported Data Requirements | Trainable Parameters |

Convolution | $O(n\xb7{d}^{2})$ | small (*) | 8M-10M (CLESS) |

Self-Attention | $O({n}^{2}\xb7d)$ | large to web-scale (*) | 125M (RoBERTa) |

**Time complexity:**Our text encoder uses a single 1D CNN encoder layer which has a complexity of $O(n\xb7k\xb7d\xb7f)$ vs. $O({n}^{2}\xb7d)$ for vanilla self-attention as outlined in Vaswani et al. [57]. Here n is the input sequence length, k is the convolution filter size, d is the input embedding dimension [$d=512$ in [57] vs. $d=100$ for us], and f is the number of convolution filters (at maximum $f=3\xb7100$ for our (3.XL) pretraining model). Since we use kernel sizes $\{1,2,3\}$ we get for the largest configuration (3.XL) an $O(n\xb7k=6\xb7d=1\xb7f=3d)\approx O(n\xb73{d}^{2})$ vs. $O({n}^{2}\xb75d)$ in a vanilla (2017) self-attention setup where d = 512. Furthermore Transformer self-attention runs an n-way soft-max computation at every layer (e.g. 16 layers), while we run $g\xb7b$ single-class predictions at the final output layer using a noise contrastive objective NCE. We use NCE to undersample both: true negative learning labels (label=0) as well as positive and negative pseudo labels (input words). If the goal is to learn a specific supervised end-task, more informed sampling of positive and negative pseudo labels can be devised. However, we did not intend to overfit the supervised task by adding such hand-crafted human biases. Instead we use random sampling to pretrain a model for arbitrary downstream tasks (generalization), which follows a similar logic as random masking does in masked language modeling.

**Transfer complexity**: Traditional transfer NLP approaches like RoBERTa [13] need to initialize a new classification head per task which requires either training a new model per task or a joint multi-task learning setup. CLESS however can train multiple tasks, even if they arrive sequentially over time, while reusing the same classifier head from prior pretraining or fine-tuning. Thus, there is no need to retrain a separate model each time as in current Transformer transfer models. Once pretrained a CLESS model can zero-shot transfer to any new task since the match classifier is reused.

## Appendix C

**Data, sample and memory efficiency:**We analyzed input data and label efficiency in the main documents zero and few-shot learning sections. Regarding data-efficiency and model design choices we were guided by the existing research and optimized for data-efficient learning with inherent self-supervised zero-shot capabilities in order to facilitate and study supervision-free generalization to unforeseen tasks. We explain the origins of these design choices in more detail below. As mentioned in the related research section, Transformers rely on large to Web-scale pretraining data collections ‘end-task external pretraining data’ [52,53], which results in extensive pretraining hardware resources [58,59], concerns about environmental costs [56,60] and unintended contra-minority biases [56,61,62]. CNNs have been found to be more data-efficient than Transformers, i.e., train to better performance with less data, several works. For example in OPENAI’s CLIP model, see Figure 2 in [30], the authors find that replacing a Transformer language model backbone with a CNN backbone increased the zero-shot data-efficiency 3 fold, which they further increased by adding a supervised contrastive learning objective. [38] showed that adding a CNN component to a vision Transformer model helps with data and computational efficiency, see Figure 5 and text in [38]. When comparing works on small-scale data pretraining capabilities between [54] (CNN, LSTM) with recent Transformer models Wang et al. [55], one can see that Transformer encoders struggle to learn from small pretraining collections. They also struggle to fine-tuning on smaller supervised collections [12,32,59]. For CLESS, tuning the embedding layer made little difference to end-task performance, when starting training with pretrained fastText word embedding. Thus embedding tuning the embedding layer can be turned off to reduce gradient computation and memory. For example, when not tuning embeddings, the CLESS 10M model has only 3.2M trainable parameters.

**Table A2.**

**Explored parameters**. We conducted a random grid search over the following hyperparameters while optimizing important parameters first to largely limit trials. We also pre-fit the filter size, lr, and filters on a 5k training subset of samples to further reduce trails. Then, to further reduce the number of trials, we tuned in the following order: learning rate $lr$, filter sizes f, max-k pooling, tuning embeddings, batch size $bs$, and finally the depth of the matching-classifier MLP. This gave us a baseline model, (2) CLESS 8M, that does not use pretraining to save trials and compute costs, but could be used to build up into the self-supervised pretraining models (3) and (3.XL) by increasing self-supervision and model size. Fortunately, RoBERTa has established default parameters reported in both its code documentation (https://github.com/pytorch/fairseq/tree/master/examples/roberta) (accessed on 30 September 2021) and the https://simpletransformers.ai (accessed on 30 September 2021) version, where we varied batch size, warmup, and learning rate around the default setting of these sources. Below we give the search parameters for CLESS. For CLESS 8M (2,3) the best params are italic and for CLESS 10M (3.XL) the best params are

**bold**.

Filter size: num filters | {1: 57, 2: 29, 3: 14}, {1:100, 2:100, 1:100},{1: 285, 2: 145, 3: 70},{1:10, 10:10, 1:10}, {1:15, 2:10, 3:5}, {1:10}, {1:100}, {10:100} |

lr | 0.01, 0.0075, 0.005, 0.001, 0.0005, 0.0001 |

bs (match size) | 1024, 1536, 4096 |

max-k | 1, 3, 7, 10 |

match-classifier | two_layer_classifier, ’conf’:[{’do’: None|.2, ’out_dim’: 2048|4196|1024}, {’do’:None|0.2}]},one_layer_classifier, ’conf’:[{’do’:.2}]} |

label encoder | one_layer_label_enc, ’conf’:[{’do’:None|.2, ’out_dim’: 100}, one_layer_label_enc, ’conf’:[{’do’: .2, ’out_dim’: 300} |

seq encoder | one_layer_label_enc, ’conf’:[{’do’:None|.2, ’out_dim’: 100}, one_layer_label_enc, ’conf’:[{’do’: .2, ’out_dim’: 300} |

tune embedding: | True, False |

#real label samples: | 20, 150, 500 (g positives (as annotated in dataset), b random negative labels—20 works well too) |

#pseudo label samples: | 20, 150, 500 (g positives input words, b negative input words)—used for self-superv. pretraining |

optimizer: | ADAM—default params, except lr |

**Parameter tuning + optima (2)-(3.XL)**We provide detailed parameter configurations as python dictionaries for reproducibility in the code repository within the

`/confs`folder. In Table A2 we see how the hyperparameters explored in CLESS—the optimal CLESS 3.XL parameters are marked in bold. The baseline CLESS configuration (2) hyperparameters were found as explained in the table, using the non-pretraining CLESS 8M (2) model—its best parameters are italic. We found these models by exploring hyperparameters that have been demonstrated to increase generalization and performance in [63,64]. To find optimal hyperparameter configurations for the baseline model (2) we ran a random grid search over the hyperparameter values seen in Table A2. For the baseline CLESS 8M model (2), without pretraining, we found optimal hyperparameters to be: $lr=0.001$ (lr=0.0005 works too), $filter\_sizes\_and\_number=\{1:\phantom{\rule{4pt}{0ex}}100,2:\phantom{\rule{4pt}{0ex}}100,3:\phantom{\rule{4pt}{0ex}}100\}$, $match\_classifier$=two_layer_classifier, ’conf’:[{’do’: None|.2, ’out_dim’: 2048 | 4196 | 1024}, $max\_{k}_{p}ooling$=7, $bs$=1536, etc.—see Table A2. Increasing the filter size, classifier size, its depth, or using larger k in k-max pooling decreased dev set performance of the non-pretrained model (i.e., CLESS 8M) due to increased overfitting. The largest pretrained CLESS 10M (3.XL) model was able to use more: ‘max-k=10’, a larger ‘label’ and ‘text sequence encoder’= one_layer_label_enc, ’conf’:[{’do’: .2, ’out_dim’: 300} while the batch size shrinks to 1024 due to increased memory requirements of label matching. Note that label and text encoder have the same output dimension in all settings—so text and label embeddings remain in the same representation dimensionality ${\mathbb{R}}^{300}$. The label encoder averages word embeddings (average pooling), while the text encoder uses a CNN with filters as in Table A2. The model receives text word ids and label-word ids, that are fed to the ‘text encoder’ and ‘label-encoder’. These encoders are sub-networks that are configured via dictionaries to have fully connected layers and dropout, with optimal configurations seen in the table. As the match-classifier, which learns to contrast the (text embedding, label embedding) pairs, we use a $two\_laye{r}_{M}LP$ which learns a similarity (match) function between text embedding to label embedding combinations (concatenations).

## References

- Hooker, S.; Courville, A.; Clark, G.; Dauphin, Y.; Frome, A. What Do Compressed Deep Neural Networks Forget? arXiv
**2020**, arXiv:1911.05248. [Google Scholar] - Hooker, S. Moving beyond “algorithmic bias is a data problem”. Patterns
**2021**, 2, 100241. [Google Scholar] [CrossRef] [PubMed] - Liu, Z.; Miao, Z.; Zhan, X.; Wang, J.; Gong, B.; Yu, S.X. Large-Scale Long-Tailed Recognition in an Open World. In Proceedings of the IEEE CVPR, Long Beach, CA, USA, 16–20 June 2019; pp. 2537–2546. [Google Scholar] [CrossRef] [Green Version]
- D’souza, D.; Nussbaum, Z.; Agarwal, C.; Hooker, S. A Tale of Two Long Tails. arXiv
**2021**, arXiv:2107.13098. [Google Scholar] - Hu, N.T.; Hu, X.; Liu, R.; Hooker, S.; Yosinski, J. When does loss-based prioritization fail? arXiv
**2021**, arXiv:2107.07741. [Google Scholar] - Hooker, S.; Moorosi, N.; Clark, G.; Bengio, S.; Denton, E. Characterizing and Mitigating Bias in Compact Models. In Proceedings of the 5th ICML Workshop on Human Interpretability in Machine Learning (WHI), Virtual Conference, 17 July 2020. [Google Scholar]
- Zhuang, D.; Zhang, X.; Song, S.L.; Hooker, S. Randomness In Neural Network Training: Characterizing The Impact of Tooling. arXiv
**2021**, arXiv:2106.11872. [Google Scholar] - Chang, W.C.; Yu, H.F.; Zhong, K.; Yang, Y.; Dhillon, I. X-BERT: eXtreme Multi-label Text Classification with using Bidirectional Encoder Representations from Transformers. arXiv
**2019**, arXiv:1905.02331. [Google Scholar] - Joseph, V.; Siddiqui, S.A.; Bhaskara, A.; Gopalakrishnan, G.; Muralidharan, S.; Garland, M.; Ahmed, S.; Dengel, A. Reliable model compression via label-preservation-aware loss functions. arXiv
**2020**, arXiv:2012.01604. [Google Scholar] - Blakeney, C.; Huish, N.; Yan, Y.; Zong, Z. Simon Says: Evaluating and Mitigating Bias in Pruned Neural Networks with Knowledge Distillation. arXiv
**2021**, arXiv:2106.07849. [Google Scholar] - Jiang, Z.; Chen, T.; Mortazavi, B.J.; Wang, Z. Self-Damaging Contrastive Learning. Proc. Mach. Learn. Res. PMLR
**2021**, 139, 4927–4939. [Google Scholar] - Rogers, A.; Kovaleva, O.; Rumshisky, A. A Primer in BERTology: What We Know About How BERT Works. Trans. Assoc. Comput. Linguist.
**2020**, 8, 842–866. [Google Scholar] [CrossRef] - Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv
**2019**, arXiv:1907.11692. [Google Scholar] - Liu, J.; Chang, W.; Wu, Y.; Yang, Y. Deep Learning for Extreme Multi-label Text Classification. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, 7–11 August 2017; pp. 115–124. [Google Scholar] [CrossRef]
- Pappas, N.; Henderson, J. GILE: A Generalized Input-Label Embedding for Text Classification. Trans. Assoc. Comput. Linguistics
**2019**, 7, 139–155. [Google Scholar] [CrossRef] - Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res.
**2020**, 21, 1–67. [Google Scholar] - Graf, F.; Hofer, C.; Niethammer, M.; Kwitt, R. Dissecting Supervised Constrastive Learning. Proc. Mach. Learn. Res. PMLR
**2021**, 139, 3821–3830. [Google Scholar] - Zimmermann, R.S.; Sharma, Y.; Schneider, S.; Bethge, M.; Brendel, W. Contrastive Learning Inverts the Data Generating Process. Proc. Mach. Learn. Res. PMLR
**2021**, 139, 12979–12990. [Google Scholar] - Zhang, H.; Xiao, L.; Chen, W.; Wang, Y.; Jin, Y. Multi-Task Label Embedding for Text Classification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 2–4 November 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 4545–4553. [Google Scholar] [CrossRef] [Green Version]
- Musgrave, K.; Belongie, S.J.; Lim, S. A Metric Learning Reality Check. In Proceedings of the Computer Vision-ECCV 2020-16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 681–699. [Google Scholar] [CrossRef]
- Rethmeier, N.; Augenstein, I. A Primer on Contrastive Pretraining in Language Processing: Methods, Lessons Learned and Perspectives. arXiv
**2021**, arXiv:2102.12982. [Google Scholar] - Saunshi, N.; Plevrakis, O.; Arora, S.; Khodak, M.; Khandeparkar, H. A Theoretical Analysis of Contrastive Unsupervised Representation Learning. Proc. Mach. Learn. Res. PMLR
**2019**, 97, 5628–5637. [Google Scholar] - Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised Contrastive Learning. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020); Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 18661–18673. [Google Scholar]
- Ostendorff, M.; Rethmeier, N.; Augenstein, I.; Gipp, B.; Rehm, G. Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings. arXiv
**2022**, arXiv:2202.06671. [Google Scholar] - Wang, T.; Isola, P. Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere. Proc. Mach. Learn. Res. PMLR
**2020**, 119, 9929–9939. [Google Scholar] - Mnih, A.; Teh, Y.W. A Fast and Simple Algorithm for Training Neural Probabilistic Language Models. In Proceedings of the 29th International Conference on International Conference on Machine Learning (ICML’12), Edinburgh, UK, 26 June–1 July 2012; Omnipress: Madison, WI, USA; pp. 419–426. [Google Scholar]
- Şerbetci, O.N.; Möller, S.; Roller, R.; Rethmeier, N. EffiCare: Better Prognostic Models via Resource-Efficient Health Embeddings. AMIA Annu. Symp. Proc.
**2020**, 2020, 1060–1069. [Google Scholar] - Kim, K.M.; Hyeon, B.; Kim, Y.; Park, J.H.; Lee, S. Multi-pretraining for Large-scale Text Classification. In Findings of the Association for Computational Linguistics: EMNLP 2020; Cohn, T., He, Y., Liu, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 2041–2050. [Google Scholar] [CrossRef]
- Tay, Y.; Dehghani, M.; Gupta, J.; Bahri, D.; Aribandi, V.; Qin, Z.; Metzler, D. Are Pre-trained Convolutions Better than Pre-trained Transformers? arXiv
**2021**, arXiv:2105.03322. [Google Scholar] - Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. Proc. Mach. Learn. Res. PMLR
**2021**, 139, 8748–8763. [Google Scholar] - Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
- Liu, L.; Liu, X.; Gao, J.; Chen, W.; Han, J. Understanding the Difficulty of Training Transformers. arXiv
**2020**, arXiv:2004.08249. [Google Scholar] - Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016. pp. 2818–2826. [CrossRef] [Green Version]
- Gao, T.; Yao, X.; Chen, D. SimCSE: Simple Contrastive Learning of Sentence Embeddings. arXiv
**2021**, arXiv:2104.08821. [Google Scholar] - Ma, Z.; Collins, M. Noise Contrastive Estimation and Negative Sampling for Conditional Models: Consistency and Statistical Efficiency. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 3698–3707. [Google Scholar] [CrossRef]
- Jiang, H.; Wang, R.; Shan, S.; Chen, X. Transferable Contrastive Network for Generalized Zero-Shot Learning. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea, 27 October–2 November 2019; pp. 9764–9773. [Google Scholar] [CrossRef] [Green Version]
- Simoncelli, E.P.; Olshausen, B.A. Natural image statistics and neural representation. Annu. Rev. Neurosci.
**2001**, 24, 1193–1216. [Google Scholar] [CrossRef] [Green Version] - Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv
**2020**, arXiv:2010.11929. [Google Scholar] - Lukasik, M.; Bhojanapalli, S.; Menon, A.; Kumar, S. Does label smoothing mitigate label noise? Proc. Mach. Learn. Res. PMLR
**2020**, 119, 6448–6458. [Google Scholar] - Davis, J.; Goadrich, M. The relationship between Precision-Recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning, (ICML 2006), Pittsburgh, PA, USA, 25–29 June 2006; pp. 233–240. [Google Scholar] [CrossRef] [Green Version]
- Fernández, A.; García, S.; Galar, M.; Prati, R.C.; Krawczyk, B.; Herrera, F. Learning from Imbalanced Data Sets; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar] [CrossRef]
- Gururangan, S.; Marasović, A.; Swayamdipta, S.; Lo, K.; Beltagy, I.; Downey, D.; Smith, N.A. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Virtual Conference, 6–8 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 8342–8360. [Google Scholar] [CrossRef]
- Poerner, N.; Waltinger, U.; Schütze, H. Inexpensive Domain Adaptation of Pretrained Language Models: Case Studies on Biomedical NER and Covid-19 QA. In Findings of the Association for Computational Linguistics: EMNLP 2020; Cohn, T., He, Y., Liu, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 1482–1490. [Google Scholar] [CrossRef]
- Tai, W.; Kung, H.T.; Dong, X.; Comiter, M.; Kuo, C.F. exBERT: Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources. In Findings of the Association for Computational Linguistics: EMNLP 2020; Cohn, T., He, Y., Liu, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 1433–1439. [Google Scholar] [CrossRef]
- Rethmeier, N.; Plank, B. MoRTy: Unsupervised Learning of Task-specialized Word Embeddings by Autoencoding. In Proceedings of the 4th Workshop on Representation Learning for NLP, RepL4NLP@ACL 2019, Florence, Italy, 2 August 2019; pp. 49–54. [Google Scholar] [CrossRef]
- Augenstein, I.; Ruder, S.; Søgaard, A. Multi-Task Learning of Pairwise Sequence Classification Tasks over Disparate Label Spaces. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 1–6 June 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 1896–1906. [Google Scholar] [CrossRef] [Green Version]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. Adv. Neural Inf. Process. Syst.
**2020**, 33, 1877–1901. [Google Scholar] - Frankle, J.; Carbin, M. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. In Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Rethmeier, N.; Saxena, V.K.; Augenstein, I. TX-Ray: Quantifying and Explaining Model-Knowledge Transfer in (Un-)Supervised NLP. In Proceedings of the Proceedings of the Thirty-Sixth Conference on Uncertainty in Artificial Intelligence, UAI 2020, Toronto, ON, Canada, 3–6 August 2020. [Google Scholar]
- Augenstein, I.; Lioma, C.; Wang, D.; Chaves Lima, L.; Hansen, C.; Hansen, C.; Simonsen, J.G. MultiFC: A Real-World Multi-Domain Dataset for Evidence-Based Fact Checking of Claims. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4685–4697. [Google Scholar] [CrossRef]
- Ma, Z.; Sun, A.; Yuan, Q.; Cong, G. Tagging Your Tweets: A Probabilistic Modeling of Hashtag Annotation in Twitter. In Proceedings of the 23rd ACM International Conference on Information and Knowledge Management (CIKM ’14), Shanghai, China, 3–7 November 2014; Li, J., Wang, X.S., Garofalakis, M.N., Soboroff, I., Suel, T., Wang, M., Eds.; Association for Computing Machinery: New York, NY, USA, 2014; pp. 999–1008. [Google Scholar]
- Liu, Q.; Kusner, M.J.; Blunsom, P. A Survey on Contextual Embeddings. arXiv
**2020**, arXiv:2003.07278. [Google Scholar] - Yogatama, D.; de Masson d’Autume, C.; Connor, J.; Kociský, T.; Chrzanowski, M.; Kong, L.; Lazaridou, A.; Ling, W.; Yu, L.; Dyer, C.; et al. Learning and Evaluating General Linguistic Intelligence. arXiv
**2019**, arXiv:1901.11373. [Google Scholar] - Merity, S.; Xiong, C.; Bradbury, J.; Socher, R. Pointer Sentinel Mixture Models. arXiv
**2017**, arXiv:1609.07843. [Google Scholar] - Wang, C.; Ye, Z.; Zhang, A.; Zhang, Z.; Smola, A.J. Transformer on a Diet. arXiv
**2020**, arXiv:2002.06170. [Google Scholar] - Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’21), Virtual Conference, 3–10 March 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 610–623. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Advances in Neural Information Processing Systems 30 (NIPS 2017); Curran Associates Inc.: Red Hook, NY, USA, 2017; ISBN 9781510860964. [Google Scholar]
- Hooker, S. The Hardware Lottery. Commun. ACM
**2020**, 64, 58–65. [Google Scholar] [CrossRef] - Dodge, J.; Ilharco, G.; Schwartz, R.; Farhadi, A.; Hajishirzi, H.; Smith, N.A. Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping. arXiv
**2020**, arXiv:2002.06305. [Google Scholar] - Strubell, E.; Ganesh, A.; McCallum, A. Energy and Policy Considerations for Deep Learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 3645–3650. [Google Scholar] [CrossRef] [Green Version]
- Mitchell, M.; Baker, D.; Moorosi, N.; Denton, E.; Hutchinson, B.; Hanna, A.; Gebru, T.; Morgenstern, J. Diversity and Inclusion Metrics in Subset Selection. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, New York, NY, USA, 7–9 February 2020; Association for Computing Machinery: New York, NY, USA, 2020. pp. 117–123. [CrossRef] [Green Version]
- Waseem, Z.; Lulz, S.; Bingel, J.; Augenstein, I. Disembodied Machine Learning: On the Illusion of Objectivity in NLP. arXiv
**2021**, arXiv:2101.11974. [Google Scholar] - Jiang, Y.; Neyshabur, B.; Mobahi, H.; Krishnan, D.; Bengio, S. Fantastic Generalization Measures and Where to Find Them. arXiv
**2019**, arXiv:1912.02178. [Google Scholar] - He, F.; Liu, T.; Tao, D. Control Batch Size and Learning Rate to Generalize Well: Theoretical and Empirical Evidence. In NeurIPS; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019. [Google Scholar]

**Figure 1.**Contrastive <text, pseudo/real label> embedding pair matcher model: A word embedding layer E➀ embeds text and real/pseudo labels, where labels are word IDs. CLESS embeds a text (‘measuring variable interaction’), real positive (R) or negative (p-value) labels, and positive (variable) or negative (median) pseudo labels. A sequence encoder T➁ embeds a single text, while a label encoder L➂ embeds c labels. Each text has multiple (pseudo) labels, so the text encoding ${\mathbf{t}}_{i}$ is repeated for, and concatenated with, each label encoding ${\mathbf{l}}_{i,l}^{\circ}$. The resulting batch of <text embedding, label embedding> pairs $[[{\mathbf{t}}_{i},{\mathbf{l}}_{i,1}^{\circ}],\cdots ,[{\mathbf{t}}_{i},{\mathbf{l}}_{i,c}^{\circ}]]$ ➃ are fed into a ‘matcher’ ➄ that is trained in ➅ as a binary noise contrastive estimation loss ${L}_{B}$ [35] over multiple label (mis-)matches $\{0,1\}$ per text instance ${\mathbf{t}}_{i}$. Unlike older works, we add contrastive self-supervision over pseudo labels as a pretraining mechanism. Here, the word ‘variable’ is a positive self-supervision (pseudo) label for a text instance ${\mathbf{t}}_{i}$, while words from other in-batch texts, e.g. ‘median’, provide negative pseudo labels.

**Figure 2.**

**Head to long-tail as 5 balanced class bins:**We bin classes by label frequency. Each bin contains equally many active label occurrences. Classes within a bin are imbalanced and become few-shot or zero-shot towards the tail, especially after train/dev/test splitting. Class frequencies are given in log scale—task data details in Section 4.

**Figure 3.**Long-tail performance (RQ-1, RQ-2), over all five head to tail class bins—see Figure 2. The tail class bin contains 80.7% or 1062/1315 of classes. The non-pretrained CLESS (2) underperforms, while RoBERTa performs the worst on the $80.7\%$ of tail classes. The largest pretrained CLESS model (3.XL) outperforms RoBERTa in tail and mid class prediction, while performing nearly on par for the $7/1315\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}=\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}0.5\%$ (most common) head classes.

**Figure 4.**Zero-shot pretraining data-efficiency: by model size, pseudo label amount and pretraining text amount. Left: The zero-shot (data-efficiency) performance of the self-supervised pretraining base model (3) is increased when, adding more self-supervision pseudo labels (3.PL+) and when increasing model parameters (3.XL). Right: When only using only a proportion of the pretraining input data texts to pretrain model (3), its zero-shot learning is slowed down proportionally, but still converges towards the 100% for all but the most extreme pretraining data reductions.

**Figure 5.**(RQ-3.3) Few-shot label-efficiency: (1) RoBERTa. (2) CLESS without pretraining. (3) CLESS with pretraining. (3.XL) CLESS pretrained with more pseudo labels and model parameter as described in (Section 5.2). $A{P}_{micro\_test}$ scores for few-shot portions: 100%, 50%, 10% of training samples with real labels. CLESS 10M outperforms RoBERTa, and retrains 93.5% of its long-tail performance using only 10% of fine-tuning label texts.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Rethmeier, N.; Augenstein, I.
Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data. *Comput. Sci. Math. Forum* **2022**, *3*, 10.
https://doi.org/10.3390/cmsf2022003010

**AMA Style**

Rethmeier N, Augenstein I.
Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data. *Computer Sciences & Mathematics Forum*. 2022; 3(1):10.
https://doi.org/10.3390/cmsf2022003010

**Chicago/Turabian Style**

Rethmeier, Nils, and Isabelle Augenstein.
2022. "Long-Tail Zero and Few-Shot Learning via Contrastive Pretraining on and for Small Data" *Computer Sciences & Mathematics Forum* 3, no. 1: 10.
https://doi.org/10.3390/cmsf2022003010