Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based Models †
Abstract
:1. Introduction
- The best algorithm and architecture to perform an effective embedding space compression were found.
- An extensive number of experiments were conducted to find the perfect balance between the autoencoder’s compression rate and the model’s capability.
- The proposed architecture underwent testing across various datasets and was evaluated for summarization, translation, and classification tasks.
- The autoencoder’s capacity to effectively generalize unseen datasets and diverse tasks was explored.
2. Background
3. Proposed Method
3.1. Transformer Encoder ()
3.2. Autoencoder ()
3.3. Transformer Decoder ()
4. Experiments
- AE (presented architecture): A pre-trained and frozen autoencoder that connects the encoder to a 3-layer decoder.
- AE-S: The same architecture without pre-training the autoencoders. They will be trained jointly with the decoder from scratch.
- LL: A small 1-layer learnable linear model to lower the encoder’s output dimensionality from to .
- PCA: The classical dimensionality reduction algorithm, incremental PCA [44], trained to project the outputs of the encoder to the 458 first principal components to preserve more than 90% of variances and use them as the decoder input.
5. Results
5.1. Validation
5.2. Summarization
5.2.1. Computational Time
5.2.2. GPU Memory Usage
5.3. Translation
5.3.1. Computational Time
5.3.2. GPU Memory Usage
5.4. Classification
5.4.1. Computational Time
5.4.2. GPU Memory Usage
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. Autoencoder Architecture
Type | Compression Rate | Number of Layers | MSE Loss |
---|---|---|---|
Linear | 32 | 6 | 0.0930 |
64 | 4 | 0.0752 | |
6 | 0.0766 | ||
8 | 0.0775 | ||
128 | 6 | 0.0637 | |
256 | 6 | 0.0453 | |
384 | 6 | 0.0278 | |
448 | 6 | 0.0239 | |
512 | 6 | 0.0200 | |
LSTM | 32 | 6 | 0.0905 |
64 | 2 | 0.1176 | |
4 | 0.0863 | ||
6 | 0.0810 | ||
8 | 0.0849 | ||
10 | 0.1043 | ||
128 | 6 | 0.0670 | |
256 | 6 | 0.0543 | |
384 | 6 | 0.0462 | |
448 | 6 | 0.0427 | |
512 | 6 | 0.0400 | |
CNN | 64 | 4 | 0.2666 |
6 | 0.2659 | ||
8 | 0.2750 |
First Projection (P1) | Second Projection (P2) | Third Projection/ Compressed Latent Space Size (C) |
---|---|---|
640 | 576 | 512 |
608 | 528 | 448 |
576 | 480 | 384 |
640 | 320 | 256 |
512 | 256 | 128 |
512 | 256 | 64 |
512 | 256 | 32 |
Appendix B. Full Results of the Validation Experiments
Models | Greedy | ||
---|---|---|---|
R-1 | R-2 | R-L | |
Transformer | 0.346 | 0.143 | 0.312 |
+AE (C = 512) | 0.368 (106%) | 0.157 (109%) | 0.325 (104%) |
+AE (C = 384) | 0.363 (104%) | 0.154 (107%) | 0.322 (103%) |
+AE (C = 128) | 0.308 (89%) | 0.114 (79%) | 0.286 (91%) |
+AE (C = 32) | 0.156 (45%) | 0.019 (13%) | 0.174 (55%) |
BARTenc | 0.355 | 0.142 | 0.310 |
+AE (C = 512) | 0.341 (96%) | 0.128 (90%) | 0.298 (96%) |
+AE (C = 384) | 0.332 (93%) | 0.121 (85%) | 0.291 (93%) |
+AE (C = 128) | 0.257 (72%) | 0.063 (44%) | 0.239 (77%) |
+AE (C = 32) | 0.145 (40%) | 0.014 (9%) | 0.168 (54%) |
BERT | 0.349 | 0.133 | 0.306 |
+AE (C = 512) | 0.339 (97%) | 0.123 (92%) | 0.298 (97%) |
+AE (C = 384) | 0.332 (95%) | 0.119 (89%) | 0.294 (96%) |
+AE (C = 128) | 0.278 (79%) | 0.074 (55%) | 0.256 (83%) |
+AE (C = 32) | 0.168 (48%) | 0.021 (15%) | 0.187 (61%) |
DistilBERT | 0.317 | 0.124 | 0.283 |
+AE (C = 512) | 0.333 (105%) | 0.123 (99%) | 0.298 (105%) |
+AE (C = 384) | 0.334 (105%) | 0.122 (98%) | 0.297 (104%) |
+AE (C = 128) | 0.287 (90%) | 0.083 (66%) | 0.265 (93%) |
+AE (C = 32) | 0.161 (50%) | 0.020 (16%) | 0.180 (63%) |
Models | Random Sampling | ||
---|---|---|---|
R-1 | R-2 | R-L | |
Transformer | 0.344 | 0.136 | 0.304 |
+AE (C = 512) | 0.363 (105%) | 0.147 (108%) | 0.315 (103%) |
+AE (C = 384) | 0.360 (104%) | 0.146 (107%) | 0.314 (103%) |
+AE (C = 128) | 0.315 (91%) | 0.110 (80%) | 0.280 (92%) |
+AE (C = 32) | 0.184 (53%) | 0.024 (17%) | 0.185 (60%) |
BARTenc | 0.349 | 0.134 | 0.301 |
+AE (C = 512) | 0.337 (96%) | 0.120 (89%) | 0.289 (96%) |
+AE (C = 384) | 0.327 (93%) | 0.112 (83%) | 0.282 (93%) |
+AE (C = 128) | 0.260 (74%) | 0.058 (43%) | 0.232 (77%) |
+AE (C = 32) | 0.174 (49%) | 0.019 (14%) | 0.179 (59%) |
BERT | 0.347 | 0.124 | 0.297 |
+AE (C = 512) | 0.339 (97%) | 0.116 (93%) | 0.289 (97%) |
+AE (C = 384) | 0.333 (95%) | 0.112 (90%) | 0.286 (96%) |
+AE (C = 128) | 0.288 (82%) | 0.072 (58%) | 0.252 (84%) |
+AE (C = 32) | 0.197 (56%) | 0.026 (20%) | 0.194 (65%) |
DistilBERT | 0.316 | 0.117 | 0.275 |
+AE (C = 512) | 0.332 (105%) | 0.116 (99%) | 0.290 (105%) |
+AE (C = 384) | 0.334 (105%) | 0.115 (98%) | 0.288 (104%) |
+AE (C = 128) | 0.297 (93%) | 0.081 (69%) | 0.259 (94%) |
+AE (C = 32) | 0.189 (59%) | 0.024 (20%) | 0.188 (68%) |
Models | Beam Search | ||
---|---|---|---|
R-1 | R-2 | R-L | |
Transformer | 0.259 | 0.116 | 0.261 |
+AE (C = 512) | 0.288 (111%) | 0.127 (109%) | 0.280 (107%) |
+AE (C = 384) | 0.278 (107%) | 0.123 (106%) | 0.274 (104%) |
+AE (C = 128) | 0.280 (108%) | 0.116 (100%) | 0.271 (103%) |
+AE (C = 32) | 0.132 (50%) | 0.021 (18%) | 0.147 (56%) |
BARTenc | 0.304 | 0.128 | 0.283 |
+AE (C = 512) | 0.312 (102%) | 0.126 (98%) | 0.285 (101%) |
+AE (C = 384) | 0.278 (91%) | 0.123 (96%) | 0.274 (96%) |
+AE (C = 128) | 0.245 (80%) | 0.071 (55%) | 0.234 (82%) |
+AE (C = 32) | 0.128 (42%) | 0.015 (11%) | 0.146 (51%) |
BERT | 0.283 | 0.117 | 0.270 |
+AE (C = 512) | 0.291 (102%) | 0.117 (100%) | 0.275 (101%) |
+AE (C = 384) | 0.272 (96%) | 0.107 (91%) | 0.263 (97%) |
+AE (C = 128) | 0.242 (85%) | 0.076 (64%) | 0.237 (87%) |
+AE (C = 32) | 0.140 (49%) | 0.020 (17%) | 0.153 (56%) |
DistilBERT | 0.302 | 0.123 | 0.280 |
+AE (C = 512) | 0.287 (95%) | 0.116 (94%) | 0.275 (98%) |
+AE (C = 384) | 0.282 (93%) | 0.114 (92%) | 0.270 (96%) |
+AE (C = 128) | 0.240 (79%) | 0.082 (66%) | 0.238 (85%) |
+AE (C = 32) | 0.134 (44%) | 0.019 (15%) | 0.147 (52%) |
Appendix C. BERTScore Results (Validation Experiment)
Model | Inference Method | Vanilla | +AE (C = 512) | +AE (C = 384) | +AE (C = 128) | +AE (C = 32) |
---|---|---|---|---|---|---|
Transformer | Greedy | 0.858 | 0.861 | 0.860 | 0.846 | 0.801 |
Random | 0.857 | 0.86 | 0.859 | 0.847 | 0.809 | |
Beam | 0.852 | 0.858 | 0.857 | 0.853 | 0.805 | |
BART | Greedy | 0.867 | 0.865 | 0.863 | 0.845 | 0.814 |
Random | 0.869 | 0.864 | 0.862 | 0.845 | 0.819 | |
Beam | 0.866 | 0.866 | 0.865 | 0.851 | 0.821 | |
BERT | Greedy | 0.858 | 0.857 | 0.854 | 0.854 | 0.809 |
Random | 0.857 | 0.856 | 0.854 | 0.843 | 0.815 | |
Beam | 0.841 | 0.854 | 0.854 | 0.846 | 0.814 | |
DistilBERT | Greedy | 0.798 | 0.855 | 0.855 | 0.842 | 0.809 |
Random | 0.836 | 0.855 | 0.854 | 0.844 | 0.815 | |
Beam | 0.802 | 0.856 | 0.855 | 0.846 | 0.814 |
Appendix D. Extra Examples of Generated Summaries (Validation Experiment)
Model | Vanilla Model-Generated Summary (Greedy) | +AE Model-Generated Summary (Greedy) |
---|---|---|
Transformer | masked men armed with handguns have robbed three banks in pittsburgh area. they are believed to have had military training and are being described as ‘armed and extremely dangerous’. the men are believed to have threatened to kidnapping those at their targets and shoot police. however, the way that the men handle | two men armed with handguns robbed three banks in pittsburgh area so far this year. the unknown men, who are seen on surveillance footage pointing their guns at bank employees’ heads, have threatened to kidnapping those at their targets and shoot police. however, the way the men handle their weapons has led |
BARTenc | two men are believed to have had military training and are being described by the fbi as ‘armed and extremely dangerous’. the men are seen holding their finger stretched along the barrel of his gun, just off of the trigger, a safety method used by law enforcement. the men, who wear dark | two pennsylvania bank robber are believed to have had military training. they are believed to have been armed and extremely dangerous. the men are believed to have been armed with a pair of masked men armed with handgun. the men are believed to have been from pittsburgh. |
BERT | two pennsylvania bank robbers have robbed three banks in the pittsburgh area so far this year. the unknown men, who are seen on surveillance footage, have threatened to kidnapping those at their targets and shoot police. the two men, both 5′ 5′–9′ and april 10, are described as | two pennsylvania bank robbers armed with handguns have been robbed in the pittsburgh area so far this year. they have been seen jumping over the counter as they take their guns at targets and shoot them at police. the two men, who are seen on surveillance footage, have threatened to kidnapping those at their |
DistilBERT | two robbers have been seen in a series of recent heisting robberies. the men are believed to have had military training and are being described as ‘armed and extremely dangerous’. the men are believed to have been armed and armed. the men are believed to have been armed and extremely dangerous. | two masked men armed with handguns have robbed three banks in the pittsburgh area so far this year. they are believed to have had military training and are being described by fbi as ‘armed and extremely dangerous’. the men are believed to have had military training and are being described by the fbi as |
Model | Vanilla Model-Generated Summary (Weighted Random Sampling) | +AE Model-Generated Summary (Weighted Random Sampling) |
---|---|---|
Transformer | masked men armed with handguns have robbed three banks in pittsburgh area so far this year. they are believed to be armed and extremely dangerous. they are thought to have been armed with handguns and are thought to be from pittsburgh. the suspects are described as white, 5′ 8′ to 5 | the men, who are seen on surveillance footage pointing guns at bank employees’ heads, have threatened to kidnapping those at their targets and shoot police. however, the way that the two men handle their weapons has led the fbi to suspect that the thieves are actually former police officers themselves. they are also |
BARTenc | two men have been robbed by the fbi since april 10, according to surveillance footage. they have been seen holding his finger stretched along the barrel of his gun. they have been seen jumping over the counter as they begin their heists. the two robbers have a gun worn during the robberies | two pennsylvania bank robbery suspects have been seen in a string of recent heists. the suspects are believed to have been from pittsburgh. the suspects are believed to be from pittsburgh because of their attitudes. |
BERT | the unknown men, who are seen on surveillance footage, have threatened to kidnap those at their targets and shoot police. the two men, both 5′ 5′–9′ and april 10, have also been taken to the bank in pittsburgh, pennsylvania. the fbi believes the two suspects may have | two pennsylvania bank robbers armed as they do a series of recent robberies. they have been described as ‘armed and extremely dangerous’. they have been seen on surveillance footage showing the two men. they have been described as ‘armed and extremely dangerous’ and dangerous. |
DistilBERT | the men, who wear dark sweatpants, are believed to be armed and extremely violent. the two men are thought to have been armed and armed. they are believed to be from pittsburgh, pennsylvania, who have been robbed three banks. the men are thought to have been wearing the gun and a gun. | two masked men are thought to have robbed three banks in pittsburgh this year. they are believed to have been armed and extremely dangerous. they have been described as armed and extremely dangerous. |
Model | Vanilla Model-Generated Summary (Weighted Random Sampling) | +AE Model-Generated Summary (Weighted Random Sampling) |
---|---|---|
Transformer | masked men armed with handguns have robbed three banks in pittsburgh area so far this year, most recently on april 10 | two men armed with handguns robbed three banks in pittsburgh area so far this year, most recently on april 10 |
BARTenc | the men, who are seen on surveillance footage pointing their guns at bank employees’ heads, have threatened to kidnapping those at their targets and shoot police. the two men are actually former police officers themselves. | two pennsylvania bank thieves are believed to have had military training and are being described by the fbi as ‘armed and extremely dangerous’. |
BERT | two pennsylvania bank robbers are believed to have had military training and are being described by | two pennsylvania bank robbers armed with handguns have been robbed in the pittsburgh area so far this year, most recently on april 10 |
DistilBERT | a pair of masked men armed with handguns have robbed three banks in the pittsburgh area so far this year, most recently on april 10. the unknown men, who are seen on surveillance footage pointing their guns at bank employees’ heads, have threatened to kidnapping and shoot police. | two masked men armed with handguns have robbed three banks in the pittsburgh area so far this year, most recently on april 10 |
Appendix E. Extra Results for BERT Model + AE (Validation Experiment)
Model | Inference Method | R-1 | R-2 | R-L |
---|---|---|---|---|
BERT + AE (C = 448) | Greedy | 0.337 | 0.123 | 0.297 |
Random | 0.337 | 0.115 | 0.289 | |
Beam | 0.283 | 0.113 | 0.270 | |
BERT + AE (C = 256) | Greedy | 0.323 | 0.109 | 0.285 |
Random | 0.323 | 0.101 | 0.277 | |
Beam | 0.272 | 0.103 | 0.262 | |
BERT + AE (C = 64) | Greedy | 0.250 | 0.048 | 0.226 |
Random | 0.234 | 0.047 | 0.227 | |
Greedy | 0.195 | 0.047 | 0.199 |
Appendix F. Samples of Generated Summaries (BART-Base Experiment)
Vanilla BART-Base | +AE (C = 384) |
---|---|
Rifaat al-Assad, 77, was kicked out of Syria ‘with nothing’ 30 years ago. He went into exile after staging failed coup against brother Hafez al Assad. Activists say his fortune was stolen during his time at heart of Syrian regime. Mr Al-Assad has vehemently denied acquiring assets in France through illegal means. Lawyer says his client’s property holdings dated back to 1984–1986. | Rifaat al-Assad, 77, went into exile in Europe after staging a failed coup. He has spent more than 30 years living a life of luxury moving between homes in Paris, London and the southern Spanish city of Marbella. His family’s assets, outlined by French customs in May 2014, are valued at around £64 million—much of it held through a web of businesses based in Luxembourg. Al-Assad has vehemently denied acquiring assets in France through illegal means. |
Rand Paul, a libertarian-leaning Kentucky senator, launched his presidential bid Tuesday in Louisville, Kentucky. He sparred with TODAY host Savannah Guthrie about his past foreign policy positions. ‘Why don’t we let me explain instead of talking over me, OK?’ he griped. Guthrie obliged, asking him if he had changed his views, but he charged ahead. | Rand Paul, a libertarian-leaning Kentucky senator, sparred with Today host Savannah Guthrie about his past foreign policy positions. Paul, who launched his presidential bid on Tuesday in Louisville, Kentucky, was joined by his wife Kelley Ashby on stage Tuesday as he declared that he would campaign to ‘take our country back’ ‘If they’re immediately saying that the agreement doesn’t mean what President Obama says, that is a big problem,’ Paul said Wednesday. |
The search area for missing Malaysia Airlines Flight 370 looks set to double in size. The search will stretch into a new equally vast area, officials from Malaysia, Australia and China say. Families of passengers and crew members still have no answers about what happened to their loved ones. | The search area for missing Malaysia Airlines Flight 370 looks set to double in size. So far, they’ve covered 60% of the priority search zone without reporting any trace of the airliner. The search of the 60,000-square- kilometer area is expected to be completed in May. |
Japanese Prime Minister Shinzo Abe is scheduled to speak Wednesday to a joint meeting of Congress. Julian Zelizer: Abe arrives in Washington at an opportune time to help along the economic centerpiece of the “pivot” Zelizer: The immediate battle in Congress is not over the TPP directly, but something called trade promotion authority. | David Rothkopf: Japanese Prime Minister Shinzo Abe to speak Wednesday to Congress. He says U.S.-Japan relations have been strained by trade promotion authority, but it’s not over. RothkopF: Obama administration needs to sell “pivot” or “rebalance” to Americans. |
Inverness Caley Thistle defender Josh Meekings has been banned for one match. Meekings was charged over the handball that thwarted a Leigh Griffiths effort. Celtic wrote to the SFA seeking ‘an understanding’ of why no penalty and red card followed. FIFA vice-president Jim Boyce says the suspension is wrong. | Inverness defender Josh Meekings has been suspended for one match. The defender was charged over the handball that thwarted a Leigh Griffiths effort in their semi-final victory. FIFA vice-president Jim Boyce says the ban should be made if the Scottish FA feel the officials in charge of this game acted improperly and made the wrong decision. |
Appendix G. Samples of Generated Translation (BART-Base Experiment)
Vanilla BART-Base | +AE (C = 384) |
---|---|
s-a ajuns, de fapt, dintr-o reuniune unică de 12 persoane convocată pentru a dezvolta un model de analiză a deciziilor cu mai multe criterii pentru a-şi sintet aliza opiniile cu privire la efectele asociate cu diferite produse cu conţinut de ngl; rezultatele reuniunii au fost rezumate într-un document de cercetare. | rea vine de la o singură reuniune a 12 persoane convocată pentru a dezvolta un model multi-critic de luare a deciziilor (mala da) pentru a-şi sintet aliza opiniile cu privire la riscurile asociate cu diferite produse care conţin fumul de tutun; rezultatele reuniunii au fost rezumate într-o lucrare de cercetare. |
ball y a apărat abordarea părţii sale şi a declarat că s-au axat doar pe “ contactul “ lor atunci când au intrat în conflict. | y a apărat abordarea echipei sale şi a declarat că s-a concentrat doar asupra “ contactului “ lor atunci când se luptă. |
în urmă cu câteva zile, fostul director al oficiului, 41 iza nedelcheva, şi alţi foşti sau actuali angajaţi ai cp ci au fost urmăriţi penal în acest caz, fiind acuzaţi că au plătit pentru serviciile percepute de companiile menţionate, deşi lucrările nu au fost niciodată realizate. | la câteva zile în urmă, fostul şef al biroului, na riza ne zele cu, şi alţi foşti sau actuali angajaţi ai co ci au fost judecaţi în acest caz, fiind acuzaţi de plata serviciilor percepute de companiile menţionate, deşi lucrările nu au fost niciodată efectuate. |
unul din doi fumători de-a lungul vieţii moare din cauza dependenţei. | unul din doi fumători de-a lungul vieţii moare din cauza dependenţei. |
vom discuta şi vom vedea. | o să vorbim şi să vedem. |
References
- Falaki, A.A.; Gras, R. A Robust Approach to Fine-Tune Pre-Trained Transformer-Based models for Text Summarization through Latent Space Compression. In Proceedings of the 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA), Nassau, Bahamas, 12–14 December 2022; pp. 160–167. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All you Need. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30, Available online: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (accessed on 1 April 2023).
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 1877–1901. [Google Scholar]
- OpenAI “GPT-4 Technical Report.” ArXiv. 2023. Available online: https://arxiv.org/abs/2303.08774 (accessed on 1 April 2023).
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:230213971. [Google Scholar]
- Strubell, E.; Ganesh, A.; McCallum, A. Energy and policy considerations for deep learning in NLP. arXiv 2019, arXiv:190602243. [Google Scholar]
- Fedus, W.; Zoph, B.; Shazeer, N. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. arXiv 2021, arXiv:210103961. [Google Scholar]
- Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7871–7880. [Google Scholar]
- Zhang, J.; Zhao, Y.; Saleh, M.; Liu, P. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning; Proceedings of Machine Learning Research: Cambridge, MA, USA, 2020; pp. 11328–11339. [Google Scholar]
- Liou, C.-Y.; Cheng, W.-C.; Liou, J.-W.; Liou, D.-R. Autoencoder for words. Neurocomputing 2014, 139, 84–96. [Google Scholar] [CrossRef]
- Liu, Y.; Liu, P.; Radev, D.; Neubig, G. BRIO: Bringing Order to Abstractive Summarization. arXiv 2022, arXiv:220316804. [Google Scholar]
- Goyal, T.; Rajani, N.F.; Liu, W.; Kryściński, W. Hydrasum: Disentangling stylistic features in text summarization using multi-decoder models. arXiv 2021, arXiv:211004400. [Google Scholar]
- Liu, Y.; Gu, J.; Goyal, N.; Li, X.; Edunov, S.; Ghazvininejad, M.; Lewis, M.; Zettlemoyer, L. Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguist. 2020, 8, 726–742. [Google Scholar] [CrossRef]
- Edunov, S.; Ott, M.; Auli, M.; Grangier, D. Understanding back-translation at scale. arXiv 2018, arXiv:180809381. [Google Scholar]
- Takase, S.; Kiyono, S. Lessons on parameter sharing across layers in transformers. arXiv 2021, arXiv:210406022. [Google Scholar]
- Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.R.; Le, Q.V. Xlnet: Generalized autoregressive pretraining for language understanding. Adv. Neural Inf. Process. Syst. 2019, 32, 5753–5763. [Google Scholar]
- Mao, Y.; Wang, Y.; Wu, C.; Zhang, C.; Wang, Y.; Zhang, Q.; Yang, Y.; Tong, Y.; Bai, J. LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression. In Proceedings of the 28th International Conference on Computational Linguistics, International Committee on Computational Linguistics, Barcelona, Spain (Online), 8–13 December 2020; pp. 3225–3234. [Google Scholar]
- Ding, S.; Shang, J.; Wang, S.; Sun, Y.; Tian, H.; Wu, H.; Wang, H. ERNIE-Doc: A Retrospective Long-Document Modeling Transformer. arXiv 2020, arXiv:2012.15688. [Google Scholar] [CrossRef]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
- Gupta, S.; Agrawal, A.; Gopalakrishnan, K.; Narayanan, P. Deep learning with limited numerical precision. In International Conference on Machine Learning; Proceedings of Machine Learning Research: Cambridge, MA, USA, 2015; pp. 1737–1746. [Google Scholar]
- Micikevicius, P.; Narang, S.; Alben, J.; Diamos, G.; Elsen, E.; Garcia, D.; Ginsburg, B.; Houston, M.; Kuchaiev, O.; Venkatesh, G.; et al. Mixed precision training. arXiv 2018, arXiv:1710.03740. [Google Scholar]
- Buciluǎ, C.; Caruana, R.; Niculescu-Mizil, A. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, 20–23 August 2006; pp. 535–541. [Google Scholar]
- Chatterjee, D. Making neural machine reading comprehension faster. arXiv 2019, arXiv:190400796. [Google Scholar]
- Turc, I.; Chang, M.-W.; Lee, K.; Toutanova, K. Well-read students learn better: On the importance of pre-training compact models. arXiv 2019, arXiv:190808962. [Google Scholar]
- Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:191001108. [Google Scholar]
- Shleifer, S.; Rush, A.M. Pre-trained summarization distillation. arXiv 2020, arXiv:201013002. [Google Scholar]
- LeCun, Y.; Denker, J.S.; Solla, S.A. Optimal brain damage. Adv. Neural Inf. Process. Syst. 1990, 2, 598–605. [Google Scholar]
- Gordon, M.A.; Duh, K.; Andrews, N. Compressing bert: Studying the effects of weight pruning on transfer learning. arXiv 2020, arXiv:200208307. [Google Scholar]
- Michel, P.; Levy, O.; Neubig, G. Are sixteen heads really better than one? arXiv 2019, arXiv:190510650. [Google Scholar]
- Sajjad, H.; Dalvi, F.; Durrani, N.; Nakov, P. Poor Man’s BERT: Smaller and Faster Transformer Models. arXiv 2020, arXiv:200403844. [Google Scholar]
- Frankle, J.; Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv 2018, arXiv:180303635. [Google Scholar]
- Prasanna, S.; Rogers, A.; Rumshisky, A. When bert plays the lottery, all tickets are winning. arXiv 2020, arXiv:200500561. [Google Scholar]
- Lagunas, F.; Charlaix, E.; Sanh, V.; Rush, A.M. Block pruning for faster transformers. arXiv 2021, arXiv:210904838. [Google Scholar]
- Zaheer, M.; Guruganesh, G.; Dubey, K.A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. Big Bird: Transformers for Longer Sequences. Adv. Neural Inf. Process. Syst. 2020, 33, 17283–17297. [Google Scholar]
- Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The long-document transformer. arXiv 2020, arXiv:200405150. [Google Scholar]
- Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-attention with linear complexity. arXiv 2020, arXiv:200604768. [Google Scholar]
- Kitaev, N.; Kaiser, Ł.; Levskaya, A. Reformer: The efficient transformer. arXiv 2020, arXiv:200104451. [Google Scholar]
- Hou, L.; Huang, Z.; Shang, L.; Jiang, X.; Chen, X.; Liu, Q. Dynabert: Dynamic bert with adaptive width and depth. arXiv 2020, arXiv:200404037. [Google Scholar]
- Sun, Z.; Yu, H.; Song, X.; Liu, R.; Yang, Y.; Zhou, D. Mobilebert: A compact task-agnostic bert for resource-limited devices. arXiv 2020, arXiv:200402984. [Google Scholar]
- Tambe, T.; Hooper, C.; Pentecost, L.; Jia, T.; Yang, E.-Y.; Donato, M.; Sanh, V.; Whatmough, P.; Rush, A.M.; Brooks, D.; et al. EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference. arXiv 2020, arXiv:201114203. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
- Ross, D.A.; Lim, J.; Lin, R.-S.; Yang, M.-H. Incremental learning for robust visual tracking. Int. J. Comput. Vis. 2008, 77, 125–141. [Google Scholar] [CrossRef]
- Lin, C.-Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Text Summarization Branches Out, Association for Computational Linguistics. Barcelona, Spain, 10–17 July 2004; pp. 74–81. Available online: https://aclanthology.org/W04-1013 (accessed on 1 April 2023).
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics. Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar] [CrossRef] [Green Version]
- Smith, L.N.; Topin, N. Super-Convergence: Very fast training of neural networks using large learning rates. In Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications; SPIE: Washington, DC, USA, 2019; Volume 11006, p. 1100612. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:14126980. [Google Scholar]
- Pereyra, G.; Tucker, G.; Chorowski, J.; Kaiser, Ł.; Hinton, G. Regularizing neural networks by penalizing confident output distributions. arXiv 2017, arXiv:170106548. [Google Scholar]
- Hermann, K.M.; Kocisky, T.; Grefenstette, E.; Espeholt, L.; Kay, W.; Suleyman, M.; Blunsom, P. Teaching machines to read and comprehend. Adv. Neural Inf. Process. Syst. 2015, 28, 1693–1701. [Google Scholar]
- Grusky, M.; Naaman, M.; Artzi, Y. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. arXiv 2018, arXiv:180411283. [Google Scholar]
- Bojar, O.; Chatterjee, R.; Federmann, C.; Graham, Y.; Haddow, B.; Huck, M.; Jimeno Yepes, A.; Koehn, P.; Logacheva, V.; Monz, C.; et al. Findings of the 2016 Conference on Machine Translation. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers; Association for Computational Linguistics: Berlin, Germany, 2016; pp. 131–198. [Google Scholar]
- Go, A.; Bhayani, R.; Huang, L. Twitter sentiment classification using distant supervision. CS224N Proj. Rep. Stanf. 2009, 1, 2009. [Google Scholar]
- Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. Bertscore: Evaluating text generation with bert. arXiv 2019, arXiv:190409675. [Google Scholar]
- Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv 2016, arXiv:160908144. [Google Scholar]
Number of Layers Types | 4 | 6 | 8 |
---|---|---|---|
Linear | 0.0813 | 0.0766 | 0.0776 |
LSTM | 0.0863 | 0.0810 | 0.0849 |
CNN | 0.2666 | 0.2659 | 0.2750 |
Decoder Input Dimension | AE Parameters Count | Decoder Parameters Count (by Encoder Type) | |||
---|---|---|---|---|---|
(Distil)BERT/ Transformer | Reduction (%) | BARTenc | Reduction (%) | ||
768 (Default) | - | 32 M | - | 48 M | - |
C = 512 | 2.3 M | 21 M | 26.26 | 32 M | 28.48 |
C = 384 | 1.8 M | 16 M | 44.44 | 24 M | 46.18 |
C = 128 | 1.1 M | 5 M | 79.84 | 8 M | 80.91 |
C = 32 | 1 M | 1 M | 92.47 | 2 M | 93.5 |
Decoder Input Dimension | AE Parameters Count | Network’s Total Parameters Count (by Encoder Type) | |||
---|---|---|---|---|---|
Transformer | BART | BERTenc | DistilBERT | ||
768 (Default) | - | 70 M | 188 M | 142 M | 98 M |
C = 512 | 2.3 M | 61 M | 174 M | 134 M | 90 M |
C = 384 | 1.8 M | 55 M | 165 M | 128 M | 84 M |
C = 128 | 1.1 M | 44 M | 149 M | 116 | 75 M |
C = 32 | 1 M | 40 M | 143 M | 112 M | 68 M |
Models | R-1 | R-2 | R-L |
---|---|---|---|
Transformer | 0.346 | 0.143 | 0.312 |
) | 0.368 (106%) | 0.157 (109%) | 0.325 (104%) |
) | 0.363 (104%) | 0.154 (107%) | 0.322 (103%) |
) | 0.308 (89%) | 0.114 (79%) | 0.286 (91%) |
) | 0.156 (45%) | 0.019 (13%) | 0.174 (55%) |
0.355 | 0.142 | 0.310 | |
) | 0.341 (96%) | 0.128 (90%) | 0.298 (96%) |
) | 0.332 (93%) | 0.121 (85%) | 0.291 (93%) |
) | 0.257 (72%) | 0.063 (44%) | 0.239 (77%) |
) | 0.145 (40%) | 0.014 (9%) | 0.168 (54%) |
BERT | 0.349 | 0.133 | 0.306 |
) | 0.339 (97%) | 0.123 (92%) | 0.298 (97%) |
) | 0.332 (95%) | 0.119 (89%) | 0.294 (96%) |
) | 0.278 (79%) | 0.074 (55%) | 0.256 (83%) |
) | 0.168 (48%) | 0.021 (15%) | 0.187 (61%) |
DistilBERT | 0.317 | 0.124 | 0.283 |
) | 0.333 (105%) | 0.123 (99%) | 0.298 (105%) |
) | 0.334 (105%) | 0.122 (98%) | 0.297 (104%) |
) | 0.287 (90%) | 0.083 (66%) | 0.265 (93%) |
) | 0.161 (50%) | 0.020 (16%) | 0.180 (63%) |
Models | R-1 | R-2 | R-L |
---|---|---|---|
BERT | 0.349 | 0.133 | 0.306 |
) | 0.339 (97%) | 0.123 (92%) | 0.298 (97%) |
) | 0.289 (82%) | 0.079 (59%) | 0.262 (85%) |
) | 0.277 (79%) | 0.083 (62%) | 0.260 (84%) |
) | 0.143 (40%) | 0.016 (12%) | 0.156 (50%) |
Models | R-1 | R-2 | R-L | # Dec Params | |
---|---|---|---|---|---|
BART | 0.419 | 0.198 | 0.393 | 96 M | |
BARTfe | +AE (C = 504) | 0.401 | 0.182 | 0.375 | 56 M |
+AE (C = 384) | 0.400 | 0.181 | 0.374 | 40 M | |
+AE (C = 120) | 0.351 | 0.133 | 0.331 | 21 M | |
+AE (C = 24) | 0.182 | 0.026 | 0.170 | 11 M |
Model | Computational Time (Minutes) | |
---|---|---|
Fine-Tuning | Inference | |
BART | 145 | 309 |
BART + AE (C = 384) | 79 | 140 |
Model | GPU Memory Usage (MB) | |
---|---|---|
Fine-Tuning | Inference | |
BART | 28,384 | 30,624 |
BART + AE (C = 384) | 10,240 | 23,680 |
Models | BLEU | # Dec Params | |
---|---|---|---|
BART | 21.05 | 96 M | |
BARTfe | +AE (C = 504) | 18.93 | 56 M |
+AE (C = 384) | 18.63 | 40 M | |
+AE (C = 120) | 13.95 | 21 M | |
+AE (C = 24) | 1.27 | 11 M |
Model | Computational Time (Minutes) | |
---|---|---|
Fine-Tuning | Inference | |
BART | 247 | 23 |
BART + AE (C = 384) | 110 | 21 |
Model | GPU Memory Usage (MB) | |
---|---|---|
Fine-Tuning | Inference | |
BART | 25,024 | 31,680 |
BART + AE (C = 384) | 13,120 | 27,200 |
Models | Accuracy (%) | # Classifier Head Params | |
---|---|---|---|
BART | 86.73 | 592 K | |
BARTfe | 80.19 | 592 K | |
BARTfe | +AE (C = 504) | 78.12 | 263 K |
+AE (C = 384) | 75.94 | 148 K | |
+AE (C = 120) | 70.04 | 16 K | |
+AE (C = 24) | 59.43 | 1 K |
Model | Computational Time (Minutes) | |
---|---|---|
Fine-Tuning | Inference | |
BART | 1200 | 83 |
BARTfe | 450 | 83 |
BART + AE (C = 384) | 480 | 85 |
Model | GPU Memory Usage (MB) | |
---|---|---|
Fine-Tuning | Inference | |
BART | 27,264 | 22,831 |
BARTfe | 3244 | 22,831 |
BART + AE (C = 384) | 3264 | 22,966 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Alam Falaki, A.; Gras, R. Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based Models. Mach. Learn. Knowl. Extr. 2023, 5, 847-867. https://doi.org/10.3390/make5030045
Alam Falaki A, Gras R. Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based Models. Machine Learning and Knowledge Extraction. 2023; 5(3):847-867. https://doi.org/10.3390/make5030045
Chicago/Turabian StyleAlam Falaki, Ala, and Robin Gras. 2023. "Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based Models" Machine Learning and Knowledge Extraction 5, no. 3: 847-867. https://doi.org/10.3390/make5030045
APA StyleAlam Falaki, A., & Gras, R. (2023). Efficient Latent Space Compression for Lightning-Fast Fine-Tuning and Inference of Transformer-Based Models. Machine Learning and Knowledge Extraction, 5(3), 847-867. https://doi.org/10.3390/make5030045