State of the Art and Future Directions of Small Language Models: A Systematic Review

Corradini, Flavio; Leonesi, Matteo; Piangerelli, Marco

doi:10.3390/bdcc9070189

Open AccessEditor’s ChoiceSystematic Review

State of the Art and Future Directions of Small Language Models: A Systematic Review

by

Flavio Corradini

¹

,

Matteo Leonesi

¹

and

Marco Piangerelli

^1,2,*

¹

Computer Science Division, School of Science and Technology, University of Camerino, Via Madonna delle Carceri 7, 62032 Camerino, Italy

²

Vici & C. S.p.A., Via Gutenberg 5, 47822 Santarcangelo di Romagna, Italy

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(7), 189; https://doi.org/10.3390/bdcc9070189

Submission received: 31 May 2025 / Revised: 15 July 2025 / Accepted: 17 July 2025 / Published: 21 July 2025

Download

Browse Figures

Versions Notes

Abstract

Small Language Models (SLMs) have emerged as a critical area of study within natural language processing, attracting growing attention from both academia and industry. This systematic literature review provides a comprehensive and reproducible analysis of recent developments and advancements in SLMs post-2023. Drawing on 70 English-language studies published between January 2023 and January 2025, identified through Scopus, IEEE Xplore, Web of Science, and ACM Digital Library, and focusing primarily on SLMs (including those with up to 7 billion parameters), this review offers a structured overview of the current state of the art and potential future directions. Designed as a resource for researchers seeking an in-depth global synthesis, the review examines key dimensions such as publication trends, visual data representations, contributing institutions, and the availability of public datasets. It highlights prevailing research challenges and outlines proposed solutions, with a particular focus on widely adopted model architectures, as well as common compression and optimization techniques. This study also evaluates the criteria used to assess the effectiveness of SLMs and discusses emerging de facto standards for industry. The curated data and insights aim to support and inform ongoing and future research in this rapidly evolving field.

Keywords:

small language models; systematic literature review; architectural compression; benchmarking; future directions; generative AI

1. Introduction

Large Language Models (LLMs) such as GPT-4 and Llama show impressive abilities to understand and generate natural language. The term “large language model” was first introduced in 2007 [1] and became widely adopted following the development of Transformer architectures [2] and GPT-style decoder models [3]. However, the computational costs and memory requirements of these models limit their accessibility and applicability in scenarios with limited hardware resources. Other common problems are hallucination, inaccuracy, bias, and difficulty adapting to the domain [4,5,6,7].

In direct response to these limitations, the field has seen the emergence of Small Language Models (SLMs). The concept of “small language models” was introduced in 2021 by Schick and Schütze, who demonstrated their effectiveness as few-shot learners [8]. Unlike their larger counterparts, SLMs are typically 7 billion parameter models (7B) or below, like Meta’s Llama 2 [9] and Microsoft’s Phi-3 family [10]. These models are not simply scaled-down versions of LLMs; they are purposefully designed to calculate runtime costs, such as inference latency, memory footprint, and power consumption.

However, efficiency is not their only strength. SLMs can be run locally on edge systems, low-power servers, and in latency-critical contexts. Consequently, these models enhance privacy and security because they run locally, without relying on external servers. See Table 1 for a detailed comparison. They are also adaptable and customizable, which makes them suitable for domain-specific use and easy to combine with LLMs [11,12,13].

Recent literature shows increasing interest in Small Language Models as a substitute for LLMs; various surveys have appeared to summarize this trend. For example, Nguyen et al. present a general survey of SLM techniques, focusing on model architectures and efficiency optimizations [13]. Similarly, Wang et al. offer a general review that covers SLM definitions, construction methods, applications, and trustworthiness considerations [12]. Other works have combined survey and experimentation: Lu et al. provide a general survey with empirical benchmarks for open-source SLMs (up to 5B parameters) to assess their capabilities and on-device performance [11]. Meanwhile, domain-specific efforts exist: Garg et al. conduct an SLR of SLMs in healthcare, illustrating their potential in clinical applications [14]. However, previous surveys are largely narrative in approach or solely confined to one specific area. By contrast, the present work is a systematic literature review (SLR) that assembles and rigorously screens the full January 2023–January 2025 corpus on SLMs.

By integrating bibliometric analysis, synthesizing prevailing architectures, compression techniques, and evaluation benchmarks, we provide a structured assessment of research gaps and emerging challenges not fully captured by previous surveys.

To structure this metric-driven synthesis, our review is organized around four key questions designed to map the research landscape: providing a panoramic overview of the types of papers, applications, and public availability; examining core technology by synthesizing how SLMs are built and made efficient; accessing how SLMs are evaluated by analyzing common benchmarks; and finally, identifying the primary challenges and research gaps the field is currently facing. This approach provides a solid foundation for understanding the current state of SLMs and an evidence base to guide future studies and accelerate progress. To answer our research questions, we used data carefully acquired from records. The data were then processed and analyzed, allowing us to derive important insights and provide accurate answers to the research questions. This paper is organized as follows:

Section 2—Methodology for the SLR illustrates the study design, the criteria adopted for the selection of publications, and the metrics used to ensure the reproducibility of the research.
Section 3—Small Language Model background overview delves into the theoretical and practical bases of such models, highlighting the differences with respect to Large Language Models and the attributes that characterize them.
Section 4, Section 5, Section 6 and Section 7—Discussions about research questions focus on the discussions and answers to the research questions posed, critically analyzing the results obtained. In this section, the most common concepts present in the state of the art will be described and explained based on the topic covered.
Section 8—Future directions and new solutions for SLM challenges provide an overview of future SLM development by examining emerging strategies designed to address the challenges identified in Research Question 4.
Section 9—Limitations of the study evaluate the scope and potential biases of our review—parameter-count thresholds, English-language restriction, and database coverage, while suggesting concrete ways future reviews can expand coverage.
Section 10—Conclusions close the paper by synthesizing the main findings and providing clear, actionable insights for researchers.

2. Methodology for the Systematic Literature Review

We conducted our SLR in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines [15]. The reader can find the complete checklist in the Supplementary Materials. Four primary databases were consulted: Scopus, IEEE Xplore, Web of Science, and ACM Digital Library. These databases were chosen for their comprehensive coverage of the academic and scientific literature, ensuring a broad and inclusive search in multiple disciplines. Search queries were limited to the title, abstract, and keyword fields of each database to maximize relevance and reduce noise from incidental mentions in full-text content.

2.1. Inclusion and Exclusion Criteria

Before screening, we used database filters to retrieve only English articles published from January 2023 to January 2025. This temporal window was chosen because the start date aligns with the emergence of SLMs as a distinct research category, particularly following foundational releases like Llama 2-7B [9] and Mistral-7B [16] that established a de facto benchmark for small models. Limiting the analysis to a two-year time-window ensures a focus on current methodologies without compromising manageability. Next, we applied a single inclusion rule: papers whose main focus is on SLMs with up to 7 billion parameters were retained.

To achieve a more comprehensive overview, we also used the Snowball technique while reviewing the articles. This involved examining the citations within each paper. If a cited paper presented a development model from the period between January 2023 and January 2025 and had 7 billion parameters or fewer, we included it in the review as well. Moreover, throughout the SLR process, we will refer to other relevant papers that we encountered, although they will not be included in analyses.

2.2. Selection Process

An advanced search query submitted to the databases was designed to capture the breadth of the topic while maintaining specificity. The query used is shown in Listing 1.

Listing 1. Database search query used in the SLR for Scopus, IEEE Xplore, Web of Science, and ACM Digital Library.

"small language model*" OR
"tiny language model*" OR
"compact language model*" OR
"lightweight language model*" OR
"efficient language model*" OR
"mobile language model*" OR
"low-resource language model*" OR
"reduced-size language model*" OR
"edge language model*" OR
"compressed language model*"

The selection process for the articles followed the multistage approach described below.

Database search: An initial automated filtering step was performed to manage the large volume of records and prepare the corpus for manual review. Using the filtering tools available in each database, we automatically removed non-English papers and those not published between January 2023 and January 2025. The remaining papers from each of the four databases were then imported into a spreadsheet file. A total of 706 records were identified: 335 from Scopus, 121 from IEEE, 115 from ACM Digital Library, and 135 from Web of Science.
Duplicate removal: Duplicate records were identified and removed. A total of 381 records were removed before screening.
Title and Abstract Screening: The screening of titles and abstracts was conducted manually by two independent reviewers (M.P. and M.L.) without any automation tools to assess each record against the inclusion criteria in Section 2.1. Although more labor-intensive, this consensus-based approach aligns with PRISMA 2020 and minimizes the risk of biases. It also improves the trustworthiness of the review’s findings, especially for novel topics, where AI screeners can miss relevant papers [15,17,18].
A formal consensus protocol was used to ensure consistency and reliability throughout the study: any discrepancies between the reviewers regarding study eligibility were resolved through a discussion meeting, and if a consensus could not be reached, a third author (F.C.) made the final determination. This initial screening resulted in the exclusion of 270 records, primarily because they were out of scope, published outside the specified date range, or focused on models larger than 7 billion parameters.
Full-text Screening and Snowballing: For the articles that passed the initial screening, the full texts were retrieved for a final validation of their eligibility. This step ensured that each study met all inclusion criteria. The same consensus protocol described above was applied to this final assessment. To ensure comprehensive coverage, we then applied the Snowball technique to this set. This involved examining the citation lists to identify additional relevant studies for each included article that met our initial criteria. This supplementary search identified an additional 15 articles, which completes the final selection for the review. Although no additional papers were excluded at this stage, full-text screening was essential to confirm eligibility and to allow comprehensive data extraction for subsequent analysis.
Data extraction: To systematically address our research questions, we performed a structured data extraction from each included study. Using a predefined spreadsheet with columns for model name, parameter size, architecture, application domain, benchmarks used, optimization techniques, etc., reviewers manually entered the relevant information during the full-text review stage. This curated dataset was then analyzed using a Python script to generate the statistics and visualizations presented in this review. The list of papers is shown in Table A1 and in our Codeberg repository List of papers extracted: https://codeberg.org/matteo_unicam/SLR_SLMs/src/branch/main/paper_list.md, (accessed on 15 July 2025). The following are the research questions that guide this study:
(a)
What are the types of papers, applications, and public availability related to Small Language Models?
(b)
What are the most prevalent architectures and their associated compression and optimization methods in current research?
(c)
What are the most common benchmarks?
(d)
What are the current challenging areas?

In conclusion, a total of 55 records were included in the review prior to snowballing. Following the snowballing criteria for papers citing models published between 2023 and January 2025 with less than 7 billion parameters, an additional 15 records were included. Hence, 70 records were included in the final review. The complete article selection process is illustrated in Figure 1 and Section 2.3 offers an overview of the selected publications.

2.3. Paper Overview

The papers were classified into two broad groups: methodology articles and model presentation articles, as shown in Table A1. This classification provides a structured overview of the current research landscape, highlighting both the development of new models and the ongoing efforts to optimize and apply existing ones in various fields. These two categories are defined as follows:

Model-focused publications: These papers introduce a new SLM or a significant modification to an existing model. They typically include details about the model architecture, the training process, and initial evaluation results.
Method-focused publications: Focus on advancing the use and understanding of existing methods through experimental studies, technical innovations, and applications. These articles test and compare model performance on specific tasks, introduce new techniques to enhance efficiency, and explore adaptations for new domains, often involving benchmarking and comparative analysis. They contribute to refining and optimizing SLMs across various domains without introducing new models.

3. Small Language Model Background Overview

This section provides a comprehensive overview of SLMs, laying the foundations for the research questions that follow. Then, it examines how these models are evolving in contemporary research.

3.1. Definition

Although the field has not settled on a hard threshold, small models are generally defined in contrast to their large-scale counterparts, which often exceed one hundred billion parameters, like DeepSeekV3 [19], for example, which contains 671B total parameters (37B active per token). By comparison, the small-model spectrum spans a few million to around seven billion parameters, with seven billion parameter systems emerging as a practical sweet spot balancing computational efficiency with knowledge capacity. Research by Allen-Zhu and Li demonstrates that language models can store approximately 2 bits of knowledge per parameter. This means a 7B model capable of storing 14B bits is enough to surpass the combined informational content of English Wikipedia and textbooks [20]. The 7B parameter size gained prominence following Meta’s release of the Llama model family [9], which established 7B as their smallest variant and demonstrated that properly designed smaller models could achieve performance comparable to much larger predecessors. Flagship releases such as Llama 2-7B and Mistral 7B illustrate how insights from neural scaling research, coupled with real-world deployment constraints, are driving this strategic shift toward compact yet capable language models [21,22]. Recent research shows that the 7B scale offers an optimal trade-off between capacity and practicality, with studies showing that such models can reduce errors more efficiently than both smaller and larger counterparts [22]. Furthermore, even at this parameter scale, models can be optimized for advanced capabilities such as extended context processing [23], making them versatile for various applications while remaining deployable on consumer hardware.

Since SLMs are less tolerant to noisy inputs, they perform best when trained on clean, high-quality data. In terms of efficiency, SLMs offer substantial advantages, including reduced computational, memory, and storage demand requirements, making them ideal for resource-limited settings and low-power devices such as edge devices. They improve privacy, security, and response times, making them suitable for applications requiring localized data handling and minimal inference latency. Training SLMs involves techniques such as quantization-aware training and selective architectural choices, with a focus on simple and comprehensible data. SLMs can be fine-tuned to match or exceed the performance of larger models in specific domains, particularly in edge deployments where computational resources are limited.

LLMs exhibit distinct characteristics and capabilities; in fact, LLMs, such as the Llama-3.1 model, demonstrate “emergent abilities” when they surpass a certain scale threshold, excelling in dialogue, logical reasoning, and programming tasks compared to smaller models. However, despite the lack of these emerging abilities, one can achieve comparable performance in domain-specific problems. However, both face trustworthiness issues, including the risk of producing hallucinations and privacy breaches [12,13].

3.2. Fundamentals

Like their larger counterparts, SLMs are almost invariably built on the Transformer architecture. The decoder-only design is the most common nowadays. Because a Transformer processes numbers rather than raw text, the input must first pass through a tokenizer that segments the text into discrete unit tokens, each assigned an integer ID. Depending on the tokenization scheme, a token may represent a character, a sub-word fragment, or an entire word.

An embedding layer converts each token ID into a dense vector, capturing token semantics and enabling reasoning beyond simple symbol matching. As Transformers have no innate sense of word order, positional encodings, in the form of fixed-length vectors, provide information about about each token’s absolute or relative position in the sequence. Transformers can be categorized into three types: encoder-only (e.g., BERT), decoder-only (e.g., GPT), and encoder–decoder (e.g., the original Transformer, T5). Encoder stacks refine token representations using self-attention and feedforward blocks, while decoder stacks use masked self-attention for autoregressive text generation, and may include cross-attention to draw from the encoder’s output in encoder–decoder configurations. Layer Normalization stabilizes learning by normalizing activations, while residual connections allow gradient flow through deep networks by creating shortcuts between layers.

A final ‘unembedding’ layer inverts the initial embedding step, translating the decoder output vectors into a probability distribution over the vocabulary, enabling coherent text generation. In practice, most of a Transformer’s parameters are in its feedforward layers. However, it is the interaction between all its components including tokenization, embedding, attention, and normalization that allows SLMs to achieve high-level competence within a compact footprint.

3.3. Attention Mechanisms

Self-attention forms the computational core of every Transformer. For each token, the model constructs a query, a key, and a value vector. Then, it measures the affinity between the query for that token and the keys of every other token, producing a set of weights that express contextual relevance. The output representation is a weighted sum of the value vectors, so each token is informed by all others, regardless of their distance in the sequence. This single operation allows the encoder to capture long-range dependencies and build rich, context-sensitive features for the entire sentence.

The decoder applies the same principle with one crucial modification: masked self-attention. Before the attention weights are computed, the positions corresponding to future tokens are hidden, so a decoder step can attend only to itself and to earlier tokens. This look-ahead mask enforces the autoregressive property: each new token is generated solely from the known past by preventing information from leaking into the future. Therefore, masked self-attention is indispensable for the decoder-only architectures that dominate current SLM designs. (e.g., [9,10,24]).

In the encoder–decoder Transformer, each decoder block supplements its masked self-attention with a cross-attention module (or encoder–decoder–attention). Here, queries come from the decoder’s previous layer, while keys and values are taken from the ‘memory’ of the encoder. This allows every position in the emerging output to consult the entire input sequence, anchoring each new token both in the already generated context (via masked self-attention) and in the source sentence (via cross-attention). Decoder-only architectures, common in SLMs, omit this stage because they operate strictly autoregressively, but in settings that require faithful sequence-to-sequence mapping, the cross-attention link is indispensable.

Both types of attention are strengthened by multi-head attention. Instead of computing a single set of attention weights, the model projects queries, keys, and values into a series of lower-dimensional subspaces and performs several attention operations in parallel (eight heads in the original design). Each head can specialize; one might track syntactic structure, while another captures long-range semantic ties, so the concatenated output offers a richer view than any single head could provide. Variants such as multi-query and grouped-query attention preserve most of this representational power while trimming memory traffic and latency.

Selecting and configuring these attention mechanisms is therefore a central design decision in SLM development. The goal is to maximize contextual expressiveness while respecting tight budgets for parameter count, inference speed, and energy consumption.

3.4. Feedforward Network

In each layer of the Transformer (both encoder and decoder), the attention sub-layer is followed by a fully connected feedforward network (FFN) applied independently to each position of the sequence. This is often called a position-wise FFN because each token in a sequence is processed independently using the same weights within a single layer, though each layer in the Transformer has its own distinct feedforward network parameters. The feedforward network typically consists of two linear transformations with a non-linear activation in between. These FFN layers often constitute an important portion of a Transformer’s total parameters, and hence their design is a critical factor, especially for SLMs aiming for efficiency. For example, the original model’s FFN expands each 512-dimensional token embedding into a 2048-dimensional hidden layer, applies rectified (ReLU), and then projects it back to 512 dimensions.

The feedforward block applies a non-linear transformation to each token, refining its features without blending information between positions that remain the role of the attention layer. Because every Transformer layer has its own feedforward weights, each layer can learn a distinct transformation. The specific design choices within the FFN influence the trade-off between model size, inference speed, and performance for SLMs, so while the original Transformer used ReLU, modern architectures, including many SLMs, often employ different activation functions for better performance or efficiency.

3.5. Layer Normalization and Residual Connections

To stabilize training and help the model converge, Transformers employ Layer Normalization at each sub-layer, combined with residual connections. After each attention or feedforward sub-layer, the Transformer adds the sub-layer’s input to its output (this is the residual or skip connection) and then applies Layer Normalization to this sum. While the original method employed standard Layer Normalization (LayerNorm), which normalizes activations regarding mean and variance, there is a visible trend in current language model architecture development, most apparent in the evolution of SLMs towards more computationally light alternatives. Root Mean Square Normalization (RMSNorm) has emerged as a popular choice, featuring in many recent and influential Small Language Model architectures (e.g., [16,24,25]). RMSNorm simplifies the standard LayerNorm calculation by normalizing activations based only on their root mean square, thus omitting the mean-centering step. Standard Layer Normalization normalizes the activations across the features (the hidden dimensions) for each data point independently, ensuring that the output of each sub-layer will have zero mean and unit variance. This technique [26] was created to improve the stability of training in deep networks by normalizing the input accumulated in neurons in a layer. Layer Normalization in the case of Transformers is important as it makes sure that there is a single scale of activations for the whole network, which results in more stable gradients and faster convergence during training. On the other hand, residual connections defeat the vanishing gradient problem and enable the direct passage of information between the network layers. Combined, residual connections and Layer Normalization (often RMSNorm in SLMs) enable deep stacks of Transformers to train: residuals enable gradients and information to flow through, while LayerNorm stabilizes activations. This ensures that each sub-layer is processing well-conditioned representations. This design was instrumental to the success of the Transformer, ensuring that the multi-head attention and FFN modules can be stacked arbitrarily even in compact SLM architectures without training instability.

3.6. Parameter Compression and Reduction in Transformers

As Transformer models grow (often reaching billions of parameters), there is a need for techniques to reduce their memory footprint and computational requirements without drastically sacrificing performance.

This need is particularly important for SLMs, which are designed to be lean and efficient. They must perform well under tight constraints, making them suitable for deployment on edge devices, mobile phones, and other systems where speed is paramount. Trimming size and computational needs is not just an afterthought for SLMs. It is central to how they are designed, trained, and tuned from the start. These optimization approaches are key to striking the right balance between capability and resource use, opening up possibilities for advanced language processing where larger models would not fit. Three widely used approaches for compressing Transformer models are knowledge distillation, pruning, and quantization. Each of these uses different principles of model reduction, including depth compression, width compression, and sparse connectivity, to improve efficiency [27].

Knowledge distillation: It is a training technique in which a large model (the teacher) transfers its knowledge to a smaller model (the student). Instead of training the small model from scratch purely on the original data, the student model is trained to mimic the behavior of the teacher. This typically involves having the teacher generate “soft” targets (such as probability distributions over classes or even intermediate representations) and training the student to match those outputs. By learning to approximate the teacher’s outputs (and sometimes its internal feature representations), the student can achieve a level of performance closer to the teacher than it would using the original training data alone. Knowledge distillation has been highly successful for Transformers. For example, DistilBERT is a distilled version of BERT that is 40% smaller and 60% faster, yet it retains 97% of BERT’s language understanding performance [28]. This shows that a well-trained student (DistilBERT) can replicate most of the capabilities of a much larger teacher model (BERT) by leveraging the guidance of the teacher during training. Overall, knowledge distillation is a powerful way to compress models: it produces an entirely new smaller model that learns from the large model, rather than simply chopping or compressing the original weights. This approach can be seen as a form of depth compression, as it effectively reduces the number of layers in the student model while preserving knowledge from the teacher model.
Pruning: It removes unnecessary or less important parts of a neural network to reduce its size. In Transformer models, network pruning can operate at various levels of granularity from removing individual weight parameters to removing entire components like attention heads, neurons in the feedforward layers, or even whole Transformer blocks. The idea is to identify components that contribute little to model performance (for example, attention heads that pay mostly redundant attention or weights with near-zero importance) and eliminate them, creating a sparser model. Structured pruning (removing larger components like heads or neurons) has the advantage that it can lead to actual speed-ups and memory savings in practice since entire units are dropped. Unstructured pruning (removing arbitrary individual weights) can greatly reduce the parameter count, though the resulting weight matrices become sparse and may need specialized hardware or libraries to realize computational gains. Either way, pruning leverages the observation that large models often have a lot of redundancy. Researchers have found that Transformers can lose many parameters with only minor drops in accuracy, especially if pruning is combined with a fine-tuning or recovery step to restore performance. In summary, pruning directly eliminates redundant model parameters or structures, yielding a smaller (often much sparser) Transformer model without retraining from scratch. In other words, pruning aligns closely with sparse connectivity, as it removes redundant parameters or layers while preserving the overall network structure.
Quantization: It reduces the memory and compute requirements of a Transformer by using lower-precision numerical representations for its parameters (and sometimes for activations). Instead of storing weights as 32-bit floating-point numbers, we might use 16-bit or 8-bit integers, for example. By quantizing a full-precision model to 8-bit, the model’s size can be roughly quartered (since 8-bit is 4× smaller than 32-bit). This directly cuts down on memory usage and can accelerate inference on hardware that supports low-precision arithmetic because operations on smaller integers are faster and more energy-efficient. The challenge is to do this without hurting the model’s accuracy. Lower precision means less numerical range and accuracy for representing weights. Techniques like post-training quantization and quantization-aware training (training the model with quantization in mind) are used to maintain performance. Transformers have been shown to tolerate moderate quantization well; for instance, many BERT-like models can be compressed to 8-bit weights with only minimal loss in downstream task accuracy. In practice, quantization offers a trade-off: a slight reduction in model accuracy for a large gain in efficiency. When carefully applied (sometimes with a small amount of fine-tuning on a calibration dataset), quantization can shrink model size and speed up inference, enabling Transformer models to run on resource-constrained devices or with lower latency.

Each of these compression techniques can be used alone or combined for even greater effect. For example, one might first prune a model and then apply quantization, or distill a large model into a smaller one and then quantize it. Recent research in efficient Transformers often explores such combinations to push model sizes to the limit of what can run on edge devices or within strict latency budgets. The overarching goal is to retain the powerful sequence modeling capabilities of the Transformer while trimming excess complexity that is not needed for a given task, making the models smaller, faster, and more efficient.

4. RQ1: What Are the Types of Papers, Applications, and Public Availability Related to Small Language Models?

In this section, we take a panoramic view of the SLM literature to answer our first research question. We catalog the literature, separate method-oriented work from model-oriented work, trace domain usage, and note whether each contribution is released to the public. By mapping these dimensions, we can see where scientific attention is concentrated, where practical value is already emerging, and where proprietary barriers still slow collective progress.

4.1. Our Findings in Context

Figure 2 and Figure 3 and Table A1 provide a visual summary of the publication landscape for SLMs. The dataset is categorized into two main categories of research papers: method-focused publications, which are 60% of studies (e.g., [29,30]), predominantly explore techniques for improving the performance of SLMs, including optimization methods, fine-tuning approaches, and compression strategies. In contrast, model-focused publications that make up about 40% of studies (see, e.g., [9,10,31]) introduce or expand SLM architectures, offering new design principles that push the limits of compactness and efficiency, to deploy new models.

SLMs find applications in diverse sectors, particularly in healthcare, education, and law (e.g., [32,33]). The focus on these domains stems from the need for models that can operate under resource constraints, ensure data privacy, and provide accurate domain-specific knowledge. In addition, question answering (QA) and information retrieval tasks are the most common applications explored in research (e.g., [25,34,35]), underscoring the utility of SLMs to efficiently extract domain-specific knowledge.

The dataset also indicates that universities are the primary contributors to research in this field, accounting for about 67% of studies, while companies contribute approximately 33%. This suggests that academia leads open scientific exploration, while companies engage in research primarily for commercial and proprietary advancements (e.g., [10,36]). The United States and China are the largest contributors to research in this domain, collectively accounting for a portion of studies, but there are also a substantial number of European papers published. A particularly insightful finding from our analysis is that while US-based institutions author 41% of all ‘model-focused’ papers, this contribution is overwhelmingly driven by the corporate sector. This suggests a strategic prioritization within the American industrial landscape towards creating new foundational models, a task that appears to be dominated by commercial laboratories rather than academic institutions. Interestingly, despite this corporate dominance, 88% of these models are released as open-source, indicating a commitment to knowledge sharing and collaborative advancement in the field, even between commercial entities that could traditionally expected to protect their intellectual property. Public availability remains a critical aspect of SLM research. The dataset reveals that around 72% of the models and methods are publicly available, confirming a trend toward open science. About 28% of models stay closed-source proprietary releases, such as Gemini [36], underscoring the tension between open-source ideals and commercial priorities.

4.2. Theoretical and Practical Implications

Our findings highlight important theoretical and practical sides of SLM research. We see a clear separation between the work that focuses on methods and the work that focuses on new models. This split shows two different paths for development: improving current techniques step by step versus building completely new designs. Both paths can lead to better SLM performance. For example, a method-focused publication [29] shows a hybrid framework that combines SLMs for SQL that achieves state-of-the-art zero-shot NL2SQL performance. In contrast, model-focused publications, such as the phi-3 model families [10], demonstrate that carefully trained and aligned small-scale models such as phi-3-mini can match or surpass much larger systems across language, reasoning, and vision benchmarks.

On the practical side, using SLMs in sensitive areas like healthcare and law shows why it is important to deploy them locally, and often, locally means limited resources. Smaller models work well with less powerful hardware, keep data private by staying on site, and often fit more smoothly into current company or institution systems. For example, the LAPIS framework [32] shows how SLMs can support law enforcement investigations while preserving confidentiality and efficiency. Magnini et al. propose an open-source SLM designed for personal medical assistant chatbots, demonstrating the importance of compact models to provide accessible and privacy-conscious healthcare [33].

The focus on question answering (QA) and information retrieval (IR) tasks indicates a wide demand in both business and society for automated ways to find specific information. SLMs provide a solution that can be affordable and specialized for particular fields. Gemma [25], as an example, is an open model built using technology from larger systems, giving developers practical tools to create these specific QA and IR applications efficiently. This shows how research breakthroughs are making specialized information tools more accessible.

Open-source models also play a role in making it easier for different groups to work together, learn more about what these models can and cannot do, and establish common ways to measure progress. At the same time, private development led by companies can spur new ideas. If these innovations are eventually shared, they could help the field move forward. There are several definitions of open source. For example, Meta Llama 2 is generally considered “open-source” in the sense that its weights and model code are publicly available for research and most commercial uses [9]. However, it is important to note that it uses a custom license, not a standard Open Source Initiative (OSI)-approved license (like MIT or Apache 2.0). The main restriction is that very large companies (those with over 700 million monthly active users) need to request a specific license from Meta for commercial use. So, while widely accessible and often called open-source, it is not strictly open-source by the purest definition due to the custom license and that specific commercial limitation. For most users and developers, it functions like an open-source model. How these public and private efforts interact will likely shape both the pace at which SLM technology develops and the extent to which it is adopted in practice.

4.3. Synthesis and Takeaways for RQ1

In conclusion to Research Question 1, our study demonstrates that the SLM research field has two general directions. Approximately 60% of the works are dedicated to improving current methodologies, while the other 40% propose new model architectures to enhance efficiency and compactness. Such improvements are particularly crucial in fields such as healthcare, education, and law, where models are required to operate under constrained resources, protect sensitive information, and provide domain-specific accuracy. Most of the research is academic (around 67%), often published openly in favor of broader scientific progress, and about a third is industry-driven, typically with a view towards commercial exploitation. While a solid majority of models and methodologies are open-source, some are closed-source. Generally speaking, the area is moving forward quickly, fueled by a mix of open collaboration and closed innovation.

5. RQ2: What Are the Most Prevalent Architectures and Their Associated Compression and Optimization Methods in Current Research?

This section presents a systematic and comprehensive review of predominant architectures, attention mechanisms, positional encoding, normalization methods, feedforward network configurations, tokenization strategies, and prevalent optimization techniques. Therefore, it synthesizes current research to elucidate trends and applications based on the selected literature.

5.1. Our Findings in Context

Our SLR reveals distinct patterns and preferences in the architectural design and optimization strategies used for SLMs. As summarized in Figure 4 and Figure 5, the parameters exhibit significant variability, with counts spanning from extremely compact sizes of less than 1 billion to mid-scale levels such as 7B. Models containing fewer than 1 billion parameters are the most frequently observed in the reviewed literature, accounting for approximately 47% of cases (e.g., the efficiency-focused FLAN-T5-small [37], the domain-specific ProtFlash for protein analysis [38], and the explicitly named TeenyTinyLlama [31]). This implies that many research groups prioritize models small enough to run on consumer-grade graphics processing units (GPUs) or even Central Processing Units (CPUs), thus fulfilling the need for accessibility and deployment in resource-constrained settings. For example, for a chatbot operating on a single GPU with limited VRAM, a 1.1B or 1.5B parameter model might be feasible. If the model must run on a typical laptop CPU, an even smaller variant (e.g., under 1B or in the tens of millions of parameters) becomes essential. Interestingly, 23% of the models were found on the 7B scale, indicating a tendency for some models to remain at the lower extreme of the multi-billion parameter spectrum, aiming for an equilibrium between efficiency and ease of use, perhaps representing a sweet spot for achieving substantial capability while remaining manageable compared to larger LLMs.

Regarding the core architecture, the most commonly used architecture based on the SLR is decoder-only; in fact, 53% of the analyzed models use this architecture. A “decoder-only” Transformer is not truly just a decoder because, without an encoder, there is nothing for a cross-attention mechanism to focus on. Instead, its layers consist of only two parts: self-attention with a causal mask (which ensures the model only looks at previous words) and a feedforward network. This setup is mainly used for tasks like generating text and following instructions. Models like GPT and Chinchilla follow this decoder-only design, and its dominance in our dataset is seen in foundational models like Llama 2 [9], compact high-capability models like Gemini Nano [36], and on-device optimized models like MobileLLM [39] mark its effectiveness for the generative and interactive tasks often targeted by SLMs. These models perform well because they predict text sequentially, creating logical and context-sensitive outputs. The encoder–decoder architecture is another common choice, especially for structured tasks such as translation, answering questions, and searching for information, or specialized tasks such as code understanding (as with CodeT5+ [40]) or formula generation (FLAME [41]). An “encoder–decoder” Transformer is generally the same as the original Transformer. They might have minor architectural improvements, such as alternative activation functions, changing the location of normalization, etc. (e.g., [42]).

Within these architectures, particular component choices demonstrate a balance between established practices and optimization efforts. Traditional multi-head self-attention remains the predominant choice with 41% instances (e.g., in models like Blaze-IT [43], TinyStories [44], and the Phi-3 family [10]). That said, other alternatives like Gated Linear Attention (with 25% occurrences), Multi-Query Attention, and Flash Attention seek to reduce computational overhead or memory usage in the attention step. These variants often represent incremental improvements or hardware-specific optimizations over the standard quadratic complexity of self-attention. Flash attention, for example, relies on GPU-friendly kernels and specialized memory management.

To overcome the quadratic scaling bottleneck of attention, some research investigates essentially distinct strategies in addition to these comparatively common optimizations. The application of linear attention in the Small-E model [42] is a notable illustration from the literature that we reviewed. Reducing computational complexity in relation to sequence length from quadratic to linear is the goal of linear attention mechanisms. As demonstrated in Small-E, this efficiency makes linear attention well-suited for handling very long sequences or operating under extreme computational constraints, aligning with the goal of creating efficient SLMs for tasks like speech synthesis. Investigating such alternatives demonstrates the ongoing work to push the limits of attention mechanisms’ efficiency, which is essential for enabling capable SLMs in situations with escalating demands or constrained resources. These variants collectively attest to the ongoing quest for more efficient transformations of token representations, particularly as input contexts grow in length, a crucial consideration even for SLMs when dealing with non-trivial sequence lengths.

Completing these advances in attention, the positional encoding mechanism is important for attention-based models to capture sequence order. Rotary Positional Embeddings (RoPE) is by far the most common in the surveyed literature, appearing in 70% of the papers (e.g., it is used in Google’s Gemma [25], TinyLlama [24], TeenyTinyLlama [31]); this confirms its wide adoption. RoPE expands on the idea of rotating vector embeddings to encode positional information, enabling better long-sequence handling and relative position awareness. Some other approaches (e.g., standard BPE-based positional indexing, learnable positional encoding, and extended variations like LongRoPE) appear in lower frequencies. Although functional, these might offer less inherent capability for handling very long sequences or might introduce more parameters (in the case of learnable embeddings) compared to the rotational mechanism of RoPE. These methods are flexible positional strategies for small models, with the capacity to handle longer sequences while maintaining minimal overhead.

Alongside positional encoding, normalization layers help stabilize training in deep neural networks, and Transformers typically employ LayerNorm as the default. However, Root Mean Square Normalization (RMSNorm) appears in 66% of surveyed models, overtaking standard ‘LayerNorm’. Its adoption by important models in the industry such as Mistral 7B [16], Llama [45], and Gemma [25] shows its perceived advantages in computational efficiency. Its simplicity compared to the standard LayerNorm accounts for its efficiency; it scales and normalizes activations using the root mean square but crucially skips the centering step (subtracting the mean). When creating SLMs for faster response times or deployment on less powerful hardware, this decrease in computation per layer can result in noticeable speed-ups during both training and inference. This is because RMSNorm focuses on normalizing the root mean square of activations rather than the mean and variance, which can reduce computations and sometimes speed up convergence.

After stabilizing the attention output via normalization, the signal passes through the feedforward block. This block, which often accounts for a substantial portion of a Transformer’s parameters, is another site of efficiency gains. Sigmoid-Weighted Linear Unit (SwiGLU) is the clear favorite in the collated dataset, occurring in 57% instances, used by models such as Qwen2 [46], TinyLlama [24], and Orca 2 [47]. This activation function, introduced as a gated version of the widely used Gaussian Error Linear Unit (GELU), attempts to harness gating mechanisms to improve representational efficiency. Through multiplication with a sigmoid-activated pathway, the gating mechanism enables the network to dynamically regulate the flow of information through the feedforward layer. This additional complexity offers advantages over a straightforward GELU. It is thought to improve the model’s ability to identify intricate patterns. Furthermore, it may help optimize the dense feedforward component, allowing SLMs to perform well with fewer parameters or layers. GELU–Gated Linear Unit (GeGLU) and other variations also appear in various papers. GeGLU is a notable variant that integrates GELU’s smooth activation curve with a gating mechanism.

While the feedforward block handles internal representation refinement, the fundamental way language is presented to the model is determined by the tokenization strategy. Tokenization often receives less attention in architectural discussions, but can affect vocabulary coverage and model size. The SLR shows that Byte-Pair Encoding (BPE) appears 45% of the times, reflecting its continued popularity for sub-word segmentation, as seen in diverse models like GPT-wee [48], Small-E for speech synthesis [42], and FLAME [41]. BPE’s sub-word approach offers a good balance. Unlike character-level tokenization, which can create very long sequences, BPE keeps the vocabulary size manageable. This, in turn, reduces the number of parameters in the embedding layer while still allowing the model to handle new words by breaking them into known sub-word units.

WordPiece and SentencePiece approaches also appear in a handful of models, illustrating that language modeling tools inherited from BERT-era and multilingual contexts remain in common use. Collectively, these findings emphasize that, even for compact architectures, well-chosen tokenization has a measurable impact on how effectively the model can represent language with fewer parameters.

A significant dimension of this review is the examination of how researchers reduce model footprints and computational demands without losing essential language capabilities. Across the collected papers, three main approaches to knowledge distillation, quantization, and pruning stand out, with additional references to embedding layer reduction and more traditional pruning strategies. However, each of these methods presents its own set of research challenges, as summarized in the heatmap in Figure 6.

Knowledge distillation: It occurs 43% of the time (e.g., employed specifically to impart mathematical reasoning skills [49,50], refine mathematical expertise from weak supervision [30], transfer complex capabilities like self-evaluation from larger models [51], enhance reasoning in knowledge-intensive tasks [52], empower SLMs with insights from teacher models [53], and implicitly in the ‘teaching’ methodology of Orca 2 [47]), suggesting that transferring knowledge from larger teacher models remains a universally popular approach. In essence, a compact student model is trained to match the output (or soft logits) of a more robust ‘teacher.’ This process produces smaller models that can retain a surprisingly high level of linguistic fluency and comprehension if the teacher covers all tasks thoroughly. The distillation process can preserve much of the teacher’s performance, and the method has become a cornerstone for building smaller specialized models in both the research and the production settings.
Quantization: It reduces numerical precision (e.g., from 32-bit floating point to 8-bit or even lower), thereby shrinking the memory footprint and often improving inference speed. It occurs 27% of the time in the dataset, for example, utilized in methods for spatial data discovery [54], lightweight model calibration [55], combined with Low-Rank Adaptation (LoRA) for greater efficiency [56], making it the second most frequent strategy after knowledge distillation. Advances in integer-only arithmetic and mixed precision strategies have improved the viability of quantized models, but researchers investigating quantization still have to balance trade-offs between accuracy drops and compression ratio. On-device applications, where memory and processing limitations are the most severe, are particularly interested in these techniques.
Pruning and parameter-efficient fine-tuning: While extremely effective in certain contexts, pruning large Transformer networks can be more complex than in simpler feedforward architectures, particularly if one aims to preserve multi-head attention fidelity (e.g., as investigated by techniques like EFTNAS [57] or within the concept of elastic language models [35]). Unlike pruning, which reduces memory and computes requirements by removing all parameters, Low-Rank Adaptation (LoRA) takes a different approach by freezing most of the pre-trained model parameters and introducing learnable low-rank update matrices, occurring 18% of the time within this reviewed set. Although LoRA does not serve as a traditional compression method such as pruning, it achieves efficiency by limiting parameter updates rather than eliminating them during fine-tuning. While originally designed for large models, Low-Rank Adaptation (LoRA) has gained traction in the SLM community for adapting models to specific domains. For instance, it has been used in the LQ-LoRA fine-tuning framework [56], for specialized applications like robot navigation (FASTNav [58]), and in mixture-of-task adapters for multitask learning [59].

5.2. Theoretical and Practical Implications

There are important implications for the theory and practice of language modeling from the patterns found in SLM architectures and optimization methods visible in Figure 7. The importance of efficiency as a primary design driver is arguably the most obvious implication. Models with fewer than one billion parameters are clearly preferred (e.g., [31,37,38]), coupled with the adoption of computationally cheaper components like RMSNorm (e.g., [16,25,45]) and SwiGLU (e.g., [24,46,47]). This shows that resource limitations are a primary factor driving development rather than merely a secondary one, allowing for deployment beyond large-scale computing infrastructure in both theory and in practice.

The prevalence of the decoder-only architecture (e.g., [9,36,39]) emphasizes the current practical focus on generative tasks, instruction following, and conversational abilities. This implies a theoretical preference for autoregressive modeling as adequate for a large number of target SLM applications, possibly giving generation fluency precedence over the complex input–output mapping features occasionally connected to encoder–decoder models (e.g., [40,41]). Furthermore, the widespread use of knowledge distillation [30,49,51] shows that researchers are practically leaning on, and even theoretically accepting, the idea of tapping into the power already built into larger models. In other words, many studies transfer the rich, detailed reasoning abilities from big models to smaller ones, allowing these compact systems to perform better than if they were trained from scratch.

In real-world applications, particularly for on-device use, techniques such as quantization (e.g., [54,55,60]) are also essential. Quantization enables SLMs to operate on tiny devices such as microcontrollers [60] or mobile phones (as demonstrated by MobileLLM [39]), by reducing memory requirements. This helps extend sophisticated language capabilities to the very edge of computing, which is especially crucial in situations where fast response, privacy, or offline operation are important.

In addition, the growing popularity of parameter-efficient fine-tuning methods such as LoRA [56,58] highlights the need for models that can be quickly adjusted to specific tasks. LoRA allows researchers to fine-tune SLMs for very targeted applications, either for robotics (as shown in FASTNav [58]) or for improved performance in multitask settings [59] without the high costs of retraining. Combining these techniques, as in the case of LQ-LoRA [56], reflects a smart engineering approach that balances several demands at once.

5.3. Synthesis and Takeaways for RQ2

The most popular designs are decoder-only models, which typically have parameters ranging from under 1 billion to around 7 billion parameters, and achieve a useful balance between text generation capability and computational efficiency. Components such as RoPE, RMSNorm, and SwiGLU are frequently utilized in these models because of their ability to provide robust performance at reasonable prices. Optimization is not an afterthought. Methods like LoRA enable effective, specialized fine-tuning; quantization makes it possible to deploy these models in resource-constrained environments; and knowledge distillation allows smaller models to inherit the capabilities of larger ones. When taken as a whole, these trends suggest that SLMs will become efficient and compact at the same time in the future, with the aim of being utilized on various platforms and devices in real-world scenarios.

6. RQ3: What Are the Most Common Benchmarks?

As standardized assessments that gauge the progress in natural language comprehension, reasoning, code generation, question answering, and other areas, benchmarks are crucial to the creation and assessment of language models. Natural language processing has been transformed by the quick development of Large Language Models, so it is critical to monitor model performance against trustworthy benchmarks. The effectiveness of SLMs in reasoning, knowledge retrieval, mathematical problem-solving, and other tasks is measured by well-established assessments that have become standard-bearers. Although there are many benchmarks available, it is still unclear which are the most widely used, why particular tasks predominate, and how usage changes over time. To address these problems, we offer a thorough examination of benchmark usage based on data that have been extracted, assembling references to different tasks and assessments from current research. By examining occurrence frequencies, categorizing benchmarks by task domain, and highlighting notable trends, we illuminate the evolving landscape of SLM evaluation and discuss potential limitations of our approach, proposing avenues for future research.

6.1. Our Findings in Context

The extracted data reveal a clear hierarchy in the popularity of benchmarks, reflecting key areas of interest in development and evaluation of SLM. Figure 8 illustrates the top 10 most frequently used benchmarks. The most often cited benchmark is Massive Multitask Language Understanding (MMLU), which is cited in studies such as TinyLlama [24], Gemini [36], and LQ-LoRA [56]. Its prominence, as it appears in about 5.45% of the mentions, indicates that it is an essential starting point for evaluating the depth of the expertise of a Small Language Model in various academic and professional fields. Its application includes both papers proposing new methodologies and those introducing new models. This suggests that it serves as a general purpose measure of core language understanding and knowledge retention, which is essential for models regardless of their particular architecture or training strategy.

With 14 mentions (4.24%), Grade School Math 8K (GSM8K) is the second most popular benchmark. Its frequent occurrence, especially in papers focused on reasoning-mathematics such as those who evaluate Phi-3 [10], Qwen2 [46], or techniques to distill mathematical skills [49], highlights the community’s strong interest in assessing and improving the step-by-step reasoning abilities of SLMs. This emphasis is in line with initiatives to advance SLMs from basic pattern matching to more intricate problem solving, a feature that is frequently cited as a differentiator for more sophisticated models.

After this, each of HellaSwag and BBH (Big-Bench Hard) received 12 mentions (3.64%). While HellaSwag’s presence emphasizes the ongoing significance of assessing common-sense reasoning and natural language inference, which are frequently evaluated in general capability models, BBH’s usage suggests an emphasis on difficult multistep reasoning tasks that call for combining various skills. This trend is further supported by benchmarks such as AI2 Reasoning Challenge (ARC), HumanEval, and WinoGrande, each of which received 11 mentions (3.33%). For example, HumanEval, used in nearly all code-generation papers, is the standard test for programming competence (e.g., CodeT5 + [40], Combining Small and Large LMs for NL2SQL [29]). WinoGrande and ARC both help evaluate reasoning and common sense skills. A comprehensive list of all identified benchmarks, their associated task types, and their percentage of occurrence is provided in Table 2.

Based on the reviewed papers, the most widely used benchmarks can be roughly categorized into groups that reflect the variety of skills researchers seek to develop and assess in SLMs:

Knowledge and reading-comprehension: Benchmarks such as MMLU, Natural Questions (NQ), SQuAD, TriviaQA, and HotpotQA test how well models retrieve, process, and synthesize textual information. These emphasize factual accuracy and contextual reasoning.
Reasoning and logic: Tasks such as GSM8K (math word problems), MultiArith, MATH (advanced math), and ARC evaluate arithmetic, logical deductions, or multistep problem solving. Researchers use these to assess whether smaller models can manage complex reasoning pipelines.
Common sense and social understanding: Tests like OpenbookQA, PIQA, SIQA, and WinoGrande measure the grasp of a model of everyday knowledge or social cues.
Code and programming skills: Specialized benchmarks (HumanEval, MBPP, CodeXGLUE) assess code generation, debugging, and analysis. These are critical for applications such as AI-assisted coding tools, where precision and logic are essential.
Multitask evaluation: Frameworks like Big-Bench Hard (BBH) and MMLU combine diverse tasks such as language understanding, math, and reasoning into unified benchmarks. They provide a holistic view of model strengths and weaknesses, highlighting capabilities single-task tests might overlook.

6.2. Theoretical and Practical Implications

The field currently views complex mathematical reasoning and broad knowledge recall as crucial litmus tests for SLM capabilities, as evidenced by the heavy emphasis on MMLU and GSM8K. This emphasis probably comes from the need to show that SLMs can match the broad knowledge and reasoning abilities of larger models despite their smaller size, which is essential for their adoption in a variety of applications, such as specialized QA systems or education. The widespread use of GSM8K, in particular, suggests a theoretical interest in comprehending and simulating the cognitive process of sequential reasoning in smaller architectures, which could lead to more reliable complex problem-solving on devices with limited resources.

This theoretical emphasis on reasoning and general knowledge is complemented by a more practical focus in the broader evaluation landscape. The diversity of benchmarks across studies emphasizes how useful it is to assess SLMs on a range of tasks related to their intended uses. For example, the need for reliable code generation is directly addressed by the use of HumanEval in software development papers (e.g., CodeT5+ [40]) or NL2SQL [29]. In the same way, although benchmarks for common sense or domain-specific knowledge are used less often, they reflect the creation of specialized SLMs for fields where context and domain-specific nuances are crucial, such as healthcare [61] or law [32]. A single benchmark score is insufficient, according to this practical diversification of evaluation; focused testing is necessary to determine fitness-for-purpose.

Additionally, a significant practical shift is indicated by the emerging but expanding use of benchmarks pertaining to Responsible AI (RAI) aspects like fairness or toxicity (e.g., ToxiGen or CRASS cited in articles like Orca 2 [47], Gemma [25], and Making SLMs Better [59]). Evaluation frameworks must go beyond performance alone, as approaches like those in Orca 2 seek to teach more sophisticated reasoning, or as SLMs like Gemma are designed with responsible deployment in mind. This pattern suggests a growing understanding that accurate SLM deployment is not enough; safety, alignment, and fairness are also necessary, necessitating more complex evaluation procedures. This could lead to conflict between the more recent, more precise benchmarks required to validate these important practical features and the more traditional, more general ones.

6.3. Synthesis and Takeaways for RQ3

New proposed or lesser-known benchmarks appear once or twice, reflecting experimentation or very recent developments. For example, EvaluatePlus, AlignBench, and IFEval target increasingly specific aspects of model alignment, interpretability, or fairness, as explored when aligning SLMs via chain-of-thought reasoning (e.g., [46]). Tools such as the RAI measurement framework investigate transparency and accountability. Meanwhile, multi-hop question answering tasks, such as 2WikiMultiHopQA, provide a more in-depth look at chain-of-thought or multistep reasoning. As SLMs evolve, future studies may rely more on these emerging datasets, particularly if they prove adept at capturing new forms of complexity or social risk. Responsible AI evaluation related to bias, toxicity detection, or fairness issues is starting to increase like the one seen with ToxiGen in Orca2 [47], but is still less frequently mentioned than more established tests. It is also interesting to note the declining popularity of traditional benchmarks that once dominated natural language processing (NLP) research, and the rise of newer options like SQuADv2 and multi-domain tests such as DROP (e.g., [57,62]). This trend shows a shift in the interest of the community toward more challenging and specialized benchmarks that test deeper reasoning, numerical skills, and a wider range of knowledge.

7. RQ4: What Are the Current Challenging Areas?

Our examination of the gathered data indicates that while SLMs exhibit increasing capabilities and efficiency benefits, they encounter challenges that are currently the focus of ongoing research. We scan each paper for any explicit statement of a problem related to our five challenge categories. If a paper explicitly described a challenge, it received one vote for that category; votes were then converted into percentages of the total number of papers. The following prevalences were obtained: generalization (25%), data availability (21%), hallucinations (20%), scalability (19%), and interpretability (15%). Understanding these hurdles is crucial for charting the future development and reliable deployment of SLMs.

7.1. Our Findings in Context

Generalization: SLMs such as the Llama series [45] offer an attractive mix of speed and frugal resource use. The downside is a more limited storage of world knowledge. This limitation shows up most clearly when the task demands broad factual coverage: on MMLU, for example, SLMs trail models trained on massive book corpora, and their learning curve on common-sense-reasoning suites tends to flatten sooner. A comparable challenge arises when the simplified BERT variants are fine-tuned for individual languages or tasks [43]. Because many subtleties never appear in the pre-training data, the models can misapply learned patterns when they stray into unfamiliar domains. Two common remedies are proving effective. First, knowledge distillation pipelines pass high-level representations from a large teacher network to a lightweight student [53]. Second, targeted data augmentation enlarges the training set precisely where coverage is thin [63]. Together, these strategies broaden the generalization reach of SLMs without inflating their parameter counts.
Data availability: Beyond generalization, the most persistent hurdle for SLMs is the shortage of high-quality training material. Due to their limited parameter budgets, SLMs overfit quickly when data are sparse, noisy, or unbalanced conditions typical of specialist or low-resource domains. To expand the evidence base, researchers are turning to automatic augmentation pipelines [63] and fully synthetic corpora. A flagship example is TinyStories: a collection of child-like narratives written entirely by an LLM that enabled sub-10 million parameter models to learn fluent story structure, provided the output was rigorously filtered [44]. The same approach, if left unchecked, can introduce artifacts and bias, underscoring the need for robust data pipelines and explicit curation standards, especially in safety-critical arenas such as healthcare [33,64], finance, and multilingual deployments.
Hallucinations: Roughly one-fifth of the failure cases reported in recent SLM studies involve the generation of factually incorrect or internally inconsistent statements. The current work tackles the issue on three fronts:
-
Evaluation: New benchmarks and metrics are being designed expressly to surface hallucinations, complementing traditional task scores.
-
Training: Knowledge-distillation pipelines that transfer ‘truthfulness’ signals from a larger teacher have been shown to suppress hallucination rates in the student model.
-
Decoding: Insights into knowledge overshadowing the dominance of frequent but inaccurate associations have inspired alternative sampling schemes that prioritize verifiable facts over popularity.
Although the problem is usually highlighted in large-scale models, smaller architectures are no less vulnerable: their lean embedding spaces and truncated context windows leave them prone to filling gaps with plausible sounding fabrications. In fact, the stakes increase as SLMs migrate to resource-constrained settings such as mobile hand-sets or edge servers, where downstream checks may be limited. For example, MobileLLM 350M [39] performs well on specific tasks like API calls, matching the accuracy of the much larger Llama-v2 7B. However, in open-ended conversations or on questions requiring deep knowledge, its limited capacity can lead it to generate plausible but inaccurate responses, a form of hallucination. This tendency to produce “plausible-sounding fabrications” is directly related to the concept of hallucinations in language models.
One way to counter the issue is to make the model spell out its reasoning. For example, after Llama-2 13B was adapted to the chain-of-thought demonstrations of Llama-2-70B, its accuracy in CommonsenseQA improved from 63.4% to 85.9%, substantially reducing unsupported answers [65]. Similarly, in the LM-Guided-CoT framework, a lightweight Flan-T5-Small (80M parameters) that generates rationales enables a frozen Flan-T5-XXL (11B) to raise its F1 score on 2WikiMultiHopQA, with improved performance in various experimental settings [66]. These concrete gains illustrate why reasoning-first prompts, particularly chain-of-thought, are increasingly used to curb hallucinations.
Scalability: Another SLM research concern is scalability, namely raw throughput and memory pressure. Scalability means keeping an SLM fast and lightweight enough to stream tokens in real time, remain stable across heterogeneous inputs, and plug into existing back-ends without blowing past latency or memory budgets. The hard limits are tokens-per-second and the RAM consumed by parameters and activations. Algorithmic efficiency is the present day solution to the problem. Classic compression tricks structured pruning [57], sub-4-bit quantization [56], and knowledge distillation [49,50,52,53] shrink the footprint without crippling accuracy. Researchers found another way to face the “Compute-optimal inference” problem that we are going to see in the next section.
Interpretability. Roughly fifteen percent of the shortcomings reported for SLMs concern our limited ability to see why a given answer emerges as an issue that becomes mission-critical in medicine [33], finance, and other regulated settings. In theory, a leaner network should be more transparent; in practice, aggressive compression (quantization [56], structured pruning [57]), gating layers such as SwiGLU, and lightweight tuning modules like LoRA [56] can tangle the internal signal path almost as thoroughly as scale does. Classic probes attention heat-maps, and gradient-based attributions run faster on SLMs, certainly, yet they still leave important causal chains in shadow. Current research therefore follows a dual track. Architectures that enforce modularity from the outset, or adopt inherently legible forms such as certain Mixture-of-Experts variants, make it easier to trace how evidence flows toward a conclusion [10]. Another one is that NLP tool kits are being returned to the quirks of compressed models, while new benchmarks rank systems on the faithfulness of explanation alongside task accuracy [60]. SLMs will often run on personal or edge devices. Long-term users and domain experts must be able to inspect, audit, and, when needed, correct the model’s reasoning.

7.2. Theoretical and Practical Implications

The obstacles that constrain SLMs rarely surface in isolation; instead, they interact in ways that shape both the science of representation learning and the pragmatics of deployment. On the theoretical front, the balance between parameter count, generalization, data availability, and hallucination control is now a central question in scaling law research. Recent evidence suggests that judicious curricula or architectural innovations can allow a sub-billion-parameter network to match the analytical depth of much larger systems [47]. At the same time, studies of synthetic corpora show that “what” a model sees can matter more than “how much” it sees, an insight that is especially salient in low-resource settings [44].

These trade-offs translate directly into deployment risks. Limited generalization narrows the range of user scenarios that SLMs can serve, while data scarcity slows progress in niche domains. Hallucinations undermine trust and can carry legal or ethical weight in high-stakes applications such as healthcare [33] and law enforcement [32]. Scalability constraints determine whether a model can operate on edge devices or meet stringent latency budgets [39,60,64], and persistent opacity remains a barrier.

7.3. Synthesis and Takeaways for RQ4

SLMs promise efficiency and broad accessibility, but their success will depend on resolving five interlocking challenges highlighted in this survey: generalization, data scarcity, hallucination control, scalability, and interpretability. The prevalence of these challenges among different SLM application areas is further illustrated in Figure 9. Generalization emerges as the most frequently cited obstacle, with data availability and factual reliability (hallucinations) close behind. Scalability and interpretability receive slightly fewer mentions, yet remain indispensable for dependable, real-world deployment.

Research communities are tackling these hurdles on multiple fronts. Architectural advances retrieval-augmented generation [67]; mixture-of-experts routing and related designs seek to stretch reasoning capacity without inflating parameter counts. Training pipelines combine distillation [50], explanation-oriented objectives [47], and adapter-style fine-tuning (LoRA) to refine performance under tight resource budgets. Parallel efforts in synthetic data production and careful corpus curation [44] mitigate data bottleneck, while specialized benchmarks and interpretability toolkits sharpen our ability to diagnose progress.

Together, the initiatives cataloged in this review show how researchers are working to give SLMs the robustness, reliability, and transparency needed for real-world impact across many domains. The next section looks beyond our original review window and examines a new wave of papers published too recently, or omitted by stricter inclusion filters. This analysis highlights emerging ideas and trends that directly address the solution strategies that are being developed for each of the five key challenges identified in this review, illuminating where the field is heading.

8. Future Directions and New Solutions for SLM Challenges

With the key obstacles facing SLMs now mapped in our SLR, we can ask the following question. What are researchers doing about them? To answer this, this section transitions from a systematic review to a targeted prospective analysis. We examined a curated selection of papers published after our cut-off or on pre-print archives, identified through a purposive search for their direct relevance to the challenges we identified and their potential impact. Stepping outside our formal dataset in this structured way lets us see where the momentum is gathering, the strategies being tested, and the directions in which the field is likely to move as it pushes SLMs capabilities past the limits identified in the systematic analysis. Table 3 offers a comprehensive and summarized overview, mapping each core challenge to its corresponding solution strategies.

8.1. Enhancing Generalization

The ability of a model to generalize turns out to be less a matter of raw size than of smart design; recent work shows that judicious architectural choices, well-paced curricula, and finely targeted tuning can equip even compact networks with unexpectedly broad skill sets.

Architectural and training innovations: Moving beyond vanilla Transformers, researchers are exploring architectural tweaks that help small models generalize. One approach (already used in LLMs) is retrieval-augmented SLMs, which equip the model with a retriever to fetch relevant text from an external corpus. By injecting pertinent facts into the context, even a small model can perform competitively on knowledge-intensive tasks, effectively generalizing beyond its fixed parameters [68]. For example, MiniRAG introduces a retrieval-augmented generation system designed for extreme simplicity and efficiency, enabling small models to integrate external knowledge effectively [69].
Another idea is modular or Mixture-of-Experts architectures, where different parts of the model handle different types of data or sub-tasks, expanding the range of what a single SLM can do. On the training side, multitasking and instruction tuning have shown promise: rather than training on one narrow task, SLMs are being trained or fine-tuned on diverse collections of tasks (spanning Q&A, reasoning, translation, etc.), which improves their zero-shot and few-shot generalization.
An important insight is that scaling up is not the only route to strong reasoning performance; with the right training regimen, models in the 1–7B range can achieve competitive reasoning ability. Techniques like structured reasoning training (e.g., training on chain-of-thought explanations) or post-training compression (distilling knowledge from a larger model) can yield robust reasoning in SLMs. These findings challenge the notion of an inherent scale threshold for generalization, instead pointing to smarter training as a way forward [70,71].
Fine-tuning strategies and knowledge distillation: To bridge the performance gap to LLMs, a common theme is knowledge transfer from larger models. Knowledge distillation has been widely used, where a large teacher model’s outputs (or even its intermediate representations) guide the training of the smaller model. This can improve the small model’s generalization on the teacher’s domain without increasing model size [50]. For instance, the Orca series of models demonstrated that by training on the explanation traces of GPT-4 (not just its final answers), a 13B model could emulate much of the larger model’s reasoning ability (e.g., [21,47]).
Refining explanations with step-by-step solutions crafted by a teacher aids SLMs in developing reasoning skills and extrapolating to new, unseen problems. Conversely, fine-tuning techniques involve progressive learning, initially training an SLM on simpler tasks before advancing to more complex ones. Additionally, contrastive tuning enhances the differentiation of embeddings. When these strategies are integrated with data-focused methods (as previously mentioned) and compression, they have the potential to boost efficiency while maintaining performance.
In summary, a small model that is well-initialized and then “educated” by a larger model’s knowledge and explanations can be generalized in ways that naive training would not achieve. Future research is extending this to new modalities and investigating how far we can push the student–teacher paradigm for SLMs.
Domain adaptation and continual learning: Generalization is especially tested when an SLM is deployed in a new domain or evolving environment. Here, lightweight adaptation techniques are crucial. Recent work underscores the effectiveness of adapter modules and LoRA for domain shifts: by inserting small learned layers into a frozen model. SLMs can quickly personalize or specialize to a new domain (e.g., medical text, legal documents) without forgetting their original capabilities. Such adapters have minimal impact on computation but allow the model to capture domain-specific patterns. Moreover, researchers advocate for continual learning approaches, where SLMs incrementally update new data over time instead of one-off fine-tuning [11]. This is particularly relevant for on-device models that can learn from user data. One proposed future direction is meta-learning: training SLMs with algorithms that make them inherently quick learners so that given a new domain, they adapt in just a few gradient steps [72]. While current SLMs may still show domain-specific overfitting (tuning on a narrow dataset can hurt performance outside that domain), ongoing research is tackling this via regularization and smarter data scheduling. The goal is an SLM that can maintain a core of general knowledge while flexibly absorbing new information when needed, essentially narrowing the generalization gap with larger models. Continued research on cross-domain evaluation benchmarks and transfer learning techniques is expected to ensure SLMs can be reliably deployed across a range of scenarios [68].

8.2. Overcoming Data Scarcity

Because small models often work in environments where data are scarce, current research prioritizes making every example count. The emphasis has shifted from ‘more tokens’ to ‘better tokens’, pairing carefully filtered corpora with creative ways to tap auxiliary sources or synthetic text to stretch limited budgets.

Synthetic data generation: A trend is to use large models to generate artificial training examples for smaller models. Synthetic text produced with LLMs has improved the performance of SLMs in low-resource settings by augmenting or substituting real data [73]. For example, TinyStories is a fully synthetic corpus of childlike stories that successfully trains very small (<10 M) models to generate fluent, factual narratives. (e.g., [21,44]). Researchers are now calling for standardized pipelines for synthetic data curation, using techniques such as automated filtering and prompt engineering to ensure high-quality, diverse training data. This interdisciplinary effort (blending generative AI with data management) is viewed as crucial to reliably improving Small Language Model knowledge and robustness.
Low-resource adaptation: Another line of work explores making the most of limited data through efficient adaptation methods. Parameter-efficient fine-tuning (e.g., adapters and LoRA modules) allows SLMs to specialize to new domains or languages with minimal training data; for example, with only 1 GB of text plus a small knowledge graph, adapter-based tuning yields gains in language modeling and downstream tasks for low-resource languages. In addition, these small multilingual models (e.g., mBERT, XLM-R) match or outperform the complete fine-tuning of a larger model while updating far fewer parameters [74]. In fact, a well-tuned 1B SLM can compete with massive LLMs (like GPT-4 or Llama) on underrepresented languages or specific tasks. This suggests that aligning model capacity to the available data (rather than simply scaling up) leads to better efficiency. Continued research is exploring meta-learning and transfer learning to further improve adaptation with extremely sparse data, as well as cross-lingual and cross-domain transfer as an interdisciplinary bridge (e.g., leveraging knowledge graphs or linguistic resources alongside text).
Interdisciplinary benchmarks: The community has started developing innovative benchmarks to drive data-efficient learning. Some works propose new evaluation tasks that simulate real-world low-resource scenarios or require models to learn from synthetic data. For example, TinyGSM introduced a generated dataset of grade school math problems with solutions, and a 1.3B SLM fine-tuned on it reached 81.5% accuracy, surpassing models 30 times larger and even rivaling its GPT-3.5 data generator. Such results spur the creation of benchmarks that test how well small models learn reasoning or domain knowledge from minimal, machine-generated data [21].
Interdisciplinary collaboration can also enrich SLM training (e.g., with cognitive science to design curricula like TinyStories, or with curators of the knowledge base to provide factual data). Going forward, researchers emphasize the need for more robust evaluation frameworks and shared datasets to ensure that improvements in Small Language Model training methodologies can be measured consistently.

8.3. Mitigating Hallucinations

Fabricated or nonsensical output remains a risk. New studies tackle the problem from two angles: sharper diagnostics that spot hallucinations more reliably, and training or decoding tweaks ranging from knowledge distillation to reasoning first prompts that measurably cut their frequency.

Evaluation and benchmarking: A first step is to reliably detect and quantify hallucinations. SLMs can sometimes produce plausible but unverified reasoning steps, leading to subtle factual errors. However, existing fact-checking methods often fail in multistep or generative tasks [68]. To address this, researchers are creating specialized benchmarks and metrics for hallucination. The object hallucination or DiaHalu benchmark, which focuses on dialogue-level hallucination evaluation [75], has been used to measure hallucination rates in model outputs [76]. Another example is OnionEval [77], a multi-layer evaluation framework with a context influence score to assess how varying context lengths impact the factual accuracy of SLMs. This revealed that while SLMs excel at factual recall, they struggle with context-heavy reasoning, often leading to hallucinations. Solutions such as prompting SLMs to reason step by step (e.g., chain-of-thought prompts) have shown marked success in reducing such hallucinations. These trends imply that future research may involve the incorporation of explicit reasoning processes into the design to enhance faithfulness or the integration of models with external knowledge bases or structured databases for real-time fact-checking. This would ensure that SLMs’ assertions are verified against reliable data before a response is finalized.
Distillation of knowledge and truthfulness: As mentioned above in the generalization section, knowledge distillation is good at tackling hallucinations. In fact, empirical results in 2025 show that distillation can reduce hallucination rates without degrading overall performance [78]. For example, in a summarization task, a 2.5B parameter student model trained on teacher explanations produced summaries with far fewer fabricated details than a conventionally fine-tuned model. In the future, we expect teacher–student frameworks (e.g., large models ‘teaching’ small ones to say ‘I don’t know’ when uncertain) and rationalized training to become key in aligning Small Language Models’ output with truth [79].
Knowledge-aware generation strategies: Researchers are also tackling hallucinations by adjusting how SLMs represent and generate knowledge. A 2025 study identified knowledge overshadowing, in which popular or context-dominant facts interfere with recall of less obvious but correct information, as a cause of hallucinations. It formalized a quantitative scaling law, showing that hallucination frequency increases predictably with the logarithm of knowledge popularity, context length, and model size. Based on this analysis, a novel decoding method, called Contrastive Decoding (CODE) [80], was proposed to amplify underrepresented facts during generation. CODE achieved substantial gains in factual accuracy in targeted hallucination benchmarks by preemptively countering the dominance of misleading context. The phenomenon of ‘knowledge overshadowing’ and methods such as Contrastive Decoding (CODE) are especially relevant for smaller language models. Due to their limited parameter count and reduced representational capacity, SLMs are more susceptible to being dominated by popular or contextually dominant information. These advances point to future research on knowledge-balanced architectures (e.g,. retrieval-augmented SLMs or dynamic context weighting) that can proactively minimize hallucinations through a better understanding of what the model does not know.

8.4. Improving Scalability

Advances in pruning, ultra-low-bit quantization, and adaptive inference are putting capable language models on everyday hardware. Meanwhile, smarter training regimes help small networks cross the line between a tiny footprint and high-level performance.

Model pruning and quantization: As we discussed earlier, these are classic techniques for shrinking model size; for example, quantization, in particular, has advanced to 4-bit or even 2-bit weights for LLMs [81]. Also, a 4-bit precision can reduce the memory footprint by 70% with negligible performance loss, and forcing models to ultra-low bit widths (with careful calibration to avoid accuracy drop) means that much larger architectures become deployable under small-model resource constraints. In the future, we might see hybrid 8-bit/4-bit SLM deployments on edge devices [11]. This would bring powerful language capabilities to smartphones and IoT hardware, representing a convergence of model compression research and hardware-aware design.
Compute-optimal inference: A striking development in 2025 is the use of Test-Time Scaling (TTS), where additional computation is used at inference (through techniques such as iterative decoding, majority voting, or tree search) to increase accuracy. With an optimal TTS strategy, extremely small models can outperform giants in challenging tasks [82]. For example, experiments found a 1B parameter model solving math problems better than a 405B parameter model when allowed more iterative reasoning and voting at test time, and this flip in performance, achieved by trading extra inference cycles for model size reduction, suggests a new research direction: compute-optimal inference.
Knowledge distillation and compression: Knowledge distillation can enhance many aspects of SLMs from generalization to scalability. For example, the Neural-Symbolic Collaborative Distillation (NesyCD) framework decouples general and specialized knowledge by combining neural and symbolic knowledge. In this framework, an SLM is taught broad reasoning skills by a large model, while task-specific rare knowledge is distilled into an explicit symbolic knowledge base, by offloading niche facts to a human-readable knowledge base and only teaching general skills to the 8B parameter model; for example, a Qwen2 7B SLM distilled using this method surpasses OpenAI’s 175B GPT-3.5 on reasoning tasks and nearly matches a 70B Llama-3 model. This underscores that clever training paradigms can compress critical knowledge and strategies from very large models into much smaller ones, improving scalability [83].

8.5. Advancing Interpretability and Explainability

Progress in understanding SLM decisions follows a twin track: one line of research builds transparency into the model itself through modular layouts, symbolic hybrids, or training constraints; the other refines post hoc tools and benchmarks so that we can probe, visualize, and score what the model is thinking once it is trained.

Architectural transparency: One promising direction is to design SLMs that are interpretable by construction. Rather than treating explainability as an afterthought (post hoc probing of a trained model), researchers are inventing training objectives that directly promote modular, disentangled representations in SLMs, adding a cluster ability loss during training to encourage the model to organize its neurons into semi-independent ’circuits’ [84]. By penalizing excessive entanglement, the resulting network develops more modular subnetworks that can be analyzed in isolation. In experiments on small Transformers, this approach yielded models that split their computations into distinct clusters, each responsible for different aspects of the task. In practice, SLMs were limited to learning simpler disjoint functions, making it easier to trace input/output pathways. Continuing in similar lines, as discussed in the Generalization section before, another promising way to make large SLMs more understandable is to build them so that their inner workings are naturally clearer using the “Mixture-of-Experts” (MoE) setup; for example, MoE-X, a model designed specifically for easier interpretation. They made sure that each expert focused on their own unique piece of the puzzle without overlap, so it was much easier to pinpoint what each expert was doing. As a result, we can check the contribution of each ‘expert’ and see how it affects the model output. That means we maintain the model’s performance while also making its decision-making clearer and more transparent [85].
Explaining and evaluating SLM Decisions: In addition to building inherently interpretable models, another active area is developing techniques to explain the output behavior of any given Small Language Model. Many standard explainability tools in NLP (such as saliency maps, attention weight analysis, and counterfactual generation) need to be adjusted to the small-model regime. Currently, most benchmarks focus on accuracy (as we saw in the SLR), but we see initial efforts to create explanation-sensitive evaluations. For example, some question answering datasets now require a reference rationale, and models are scored on rationale correctness in addition to the answer. SLMs could particularly benefit from these, as they encourage training for faithful reasoning. Another future direction is to leverage human-in-the-loop evaluations for SLM interpretability. Because SLMs can be deployed on a scale (for example, on personal devices), and end users or domain experts could directly inspect or adjust the model’s reasoning.

To provide a synthetic overview of how the most common techniques address the identified challenges, Table 4 maps each method to the corresponding issues, emphasizing those most recurrent in the literature.

9. Limitations of the Study

While this systematic review was conducted with methodological rigor following the PRISMA guidelines, several limitations should be considered when interpreting its findings. These limitations define the boundaries of our analysis and offer avenues for future research.

9.1. Scope and Definitional Boundaries

Our review defined SLMs as models with a cap of 7 billion parameters, providing a concrete inclusion criterion to ensure reproducibility. However, the definition of an “SLM” remains fluid and contested, and as the field evolves, this threshold may become outdated, potentially excluding significant models just beyond this boundary. Similarly, our search was limited to articles published between January 2023 and January 2025. Given the exceptionally rapid innovation in language models, important papers appearing after our search cutoff may have been omitted. Although we address emerging trends from more recent studies in Section 9, our quantitative analysis is confined strictly to the specified time frame.

9.2. Search and Selection Bias

Our search strategy focused on four major academic databases that primarily index peer-reviewed literature, which introduces a bias toward formally published work. In fields like computer science and AI, where much cutting-edge research first appears on pre-print platforms such as arXiv, this approach may underrepresent recent innovations that have not yet undergone peer review. Adding to this, we limited our review to articles in English, which excludes a significant body of research published in other languages. Although our results highlight a high volume of research output from institutions in China, studies published in Mandarin and other non-English languages were not included, which reduces the breadth of global perspectives.

9.3. Analysis and Interpretation Limitations

We did not conduct a formal risk-of-bias assessment for each of the 70 included studies. Due to the heterogeneity of paper types ranging from empirical benchmarks to theoretical proposals, applying a single standardized tool was not feasible. As a result, our analysis is based on the findings reported by the original authors. This limitation is closely related to another source of subjectivity: the thematic categorization of papers. Assigning studies to categorize, such as “model” versus “method,” or mapping them to specific applications and challenges, required reviewer judgment. Although we followed a consensus protocol, these classifications remain inherently interpretive.

These limitations, while important, also highlight clear directions for future work. Future reviews could expand the scope to include pre-prints, incorporate non-English literature through multilingual collaboration, or focus on a more homogeneous subset of papers to allow for a formal risk-of-bias assessment.

10. Conclusions

This SLR offers a data-driven analysis of current and future trends in SLMs; adopting a reproducible methodological approach, we conducted an extensive search of four major databases—Scopus, IEEE Xplore, Web of Science, and ACM Digital Library—to obtain comprehensive and general access to articles. The inclusion and exclusion criteria were established, limiting the studies to English articles with dates from January 2023 to January 2025, inclusive, and considering models with less than 7 billion parameters. A strict selection process involved preliminary database searches, duplicate removal, title and abstract screening, full-text analysis, and additional application of the Snowball method for the detection of pertinent citations. We conducted an SLR centered on four key research questions. Our work maps the SLM ecosystem, clarifies its current state of the art, and uncovers promising directions for future research.

10.1. Divergent Contributions of Academia and Industry

We noticed a distinct division between the types of publications, with approximately 60% addressing methodologies and 40% introducing novel models. This division reflects the double approach of the research community to both refine existing techniques and craft new architectures. Universities drive the research (67% of publications), and the strong trend towards open-source availability (72%) indicates more commitment to shared scientific progress. While academia leads open scientific exploration, companies contribute approximately 33% of studies, primarily for commercial and proprietary advancements. About 28% of models and methods remain closed-source, reflecting these proprietary advances, such as Gemini [36]. This highlights an ongoing tension between open-source ideals and private sector needs for competitive advantage.

Geographically, the United States and China are the largest contributors to research in this domain. Together, they account for a significant portion of all studies. A substantial number of papers are published in Europe as well. This research effort has fueled a wide range of real-world applications. SLMs find diverse applications, particularly in healthcare, education, and law. Instead, the most common tasks are question answering (QA) and information retrieval (IR), underscoring the utility of SLMs to efficiently extract domain-specific knowledge.

10.2. Emerging Designs for Efficient Model Architecture

We reported the adoption of decoder-only models (53%), particularly for conversational and generative use cases, models having fewer than 1 billion parameters being the most common (47%). We discovered strong component preferences such as RoPE positional encoding (70%), RMSNorm (66%), and SwiGLU activation functions (57%), which point to industry-wide convergence on efficient design choices.

In terms of specific attention mechanisms, traditional multi-head self-attention remains predominant (41%), but alternatives like Gated Linear Attention (25%) and Multi-Query Attention are gaining traction to reduce computational overhead and memory usage.

Complementing these architectural trends, model compression and fine-tuning strategies further push efficiency. Knowledge distillation sits at the forefront (43%), distilling large teacher models into smaller and faster students. Quantization follows (27%), reducing numeric precision to shrink the model size, while parameter-efficient tuning methods such as LoRA (18%) enable adaptation with minimal additional weights. Together, these practices form a clear plan for building and deploying high-performance, resource-lean language models.

10.3. The Emphasis on Reasoning and Responsible AI in SLM Benchmarks

Our analysis indicated MMLU as the most frequently referenced benchmark (5.45% of citations), followed closely by GSM8K (4.24%). These benchmarks have become the de facto tools for assessing SLMs’ performance.

MMLU’s prominence signifies its role as an essential starting point for evaluating the depth of an SLM’s expertise across a variety of academic and professional fields. It serves as a general purpose measure of core language understanding and knowledge retention, essential regardless of a model’s specific architecture or training strategy. The frequent use of GSM8K highlights a strong interest in improving the step-by-step reasoning abilities of SLMs. This aligns with the broader goal of advancing models from basic pattern matching to intricate problem-solving.

Moreover, other prevalent benchmarks include HellaSwag and Big-Bench Hard (BBH); both of them received 3.64% of mentions. HellaSwag emphasizes commonsense reasoning and natural language inference, while BBH focuses on difficult multistep reasoning tasks that require combining various skills. HumanEval (3.33% of mentions) is crucial for assessing coding proficiency, directly linking SLM evaluation to software development applications. This broad evaluation landscape emphasizes the importance of assessing SLMs across tasks relevant to their intended uses; for example, supporting the deployment of domain-specific SLMs in fields like healthcare or law.

We also observed a trend for more specialized benchmarks for testing Responsible AI domains, such as ToxiGen and CRASS, which indicates a growing sophistication in addressing ethical considerations such as fairness and toxicity detection in addition to performance metrics. This signifies an increasing understanding that accurate SLM deployment is not enough; safety, alignment, and fairness are also necessary, necessitating a more complex evaluation procedure.

10.4. Research Challenges and Clear Solution Pathways

Our analysis revealed that meaningful challenges remain, with generalization issues appearing in 25% of papers, followed by data availability problems (21%), hallucinations (20%), scalability limitations (19%), and interpretability (15%). These problems do not appear in isolation, but compound in complex ways that affect both theoretical research directions and practical considerations.

To address these issues, we expanded our analysis to articles outside the data extracted for the SLR. We report solutions that we found particularly promising (as shown in Table 3). These solutions could serve as a basis for future research directions specifically aimed at tackling the five core challenges facing SLMs today.

Generalization: SLMs face limitations in broad factual coverage and can misapply learned patterns in unfamiliar domains. Solutions involve architectural and training innovations, such as retrieval-augmented SLMs that fetch external facts [69], modular or Mixture-of-Experts architectures, and multitask or instruction tuning on diverse tasks. Furthermore, domain adaptation and continual learning techniques like adapter modules and LoRA allow SLMs to quickly specialize to new domains with minimal computational impact, aiming to narrow the generalization gap with larger models.
Data Availability: SLMs struggle with sparse, noisy, or unbalanced data due to their limited parameter budgets. Current research prioritizes “better tokens” over “more tokens” by utilizing synthetic data generation from Large Language Models (LLMs) to augment or substitute real data, like the TinyStories corpus [44].
Hallucinations: Approximately one-fifth of reported failures in SLM studies involve the generation of factually incorrect or inconsistent statements, particularly in open-ended conversational scenarios or when requiring extensive encyclopedic knowledge. Solutions include creating specialized benchmarks and metrics for hallucination detection to complement traditional task scores. Additionally, knowledge-aware generation strategies like Contrastive Decoding (CODE) [80] amplify underrepresented facts during generation to counteract the dominance of popular but misleading information, which is particularly relevant for SLMs due to their limited capacity.
Scalability: This challenge concerns maintaining fast, lightweight, and stable SLM performance under tight latency and memory budgets. Solutions involve model pruning and quantization, with advancements allowing for ultra-low-bit weights (e.g., 4-bit or 2-bit), reducing memory footprint, and enabling deployment on resource-constrained devices like smartphones and IoT hardware. Furthermore, knowledge distillation frameworks like Neural-Symbolic Collaborative Distillation (NesyCD) [83] improve scalability by decoupling general and specialized knowledge, offloading niche facts to external knowledge bases and focusing on teaching general skills to SLMs.
Interpretability: Around 15% of reported shortcomings in SLMs relate to the limited ability to understand why a specific answer is generated, which is critical in high-stakes fields like medicine and finance. One promising avenue is to design models with built-in modularity—for example, using Mixture-of-Experts (MoE) architectures like MoE-X [85], where each expert specializes in a distinct sub-task. By ensuring these experts do not overlap in their focus, we can trace the model’s reasoning back to the responsible component.

This SLR is a contribution to the community by giving a data-driven analysis of the state of the art of SLMs from various perspectives, also quantifying challenges, and presenting some potential future research directions. Our findings accentuate that SLMs represent not only downsized versions of their larger counterparts, but a distinct research direction with unique challenges, opportunities, and practical applications.

Supplementary Materials

The PRISMA checklist can be downloaded at: https://codeberg.org/matteo_unicam/SLR_SLMs/src/branch/main/PRISMA_CHECKLIST_SLR.pdf, accessed on 15 July 2025.

Author Contributions

Conceptualization M.P.; methodology, M.P. and M.L.; investigation, M.L.; writing—original draft preparation, M.P. and M.L.; writing—review and editing, M.P., M.L., and F.C.; supervision, M.P. and F.C.; funding acquisition, F.C. All authors have read and agreed to the published version of the manuscript.

Funding

Our work was partially funded by the European Union—NextGenerationEU, Mission 4, Component 1, under the Italian Ministry of University and Research (MUR) National Innovation Ecosystem grant ECS00000041-VITALITY-CUP J13C22000430001.

Data Availability Statement

Data are contained within the article and Supplementary Materials.

Acknowledgments

Matteo Leonesi’s work was funded by Cosmari SpA.

Conflicts of Interest

Marco Piangerelli was employed by Vici & C S.p.A. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A

Table A1. List of papers extracted and analyzed, categorized by type of paper.

Paper Title	Type
AcKnowledge: A Small Language Model for Open-Domain Question Answering [62]	model
Snowballed: TinyStories: How Small Can Language Models Be and Still Speak Coherent English? [44]	model
GPT-wee: How Small Can a Small Language Model Really Get? [48]	model
FLAME: A Small Language Model for Spreadsheet Formulas [41]	model
Exploring Transformers as Compact, Data-efficient Language Models [86]	model
Expanding the Vocabulary of BERT for Knowledge Base Construction [87]	model
On Elastic Language Models [35]	model
Snowballed: Gemma: Open models based on gemini research and technology [25]	model
Snowballed: Phi-3 technical report: A highly capable language model locally on your phone [10]	model
Development of Language Models for Continuous Uzbek Speech Recognition System [88]	model
Snowballed: MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases [39]	model
Snowballed: Qwen2 technical report [46]	model
LAPIS: Language Model-Augmented Police Investigation System [32]	model
Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis [42]	model
Snowballed: CodeT5+: Open Code Large Language Models for Code Understanding and Generation [40]	model
Deciphering the protein landscape with ProtFlash, a lightweight language model [38]	model
Snowballed: TinyLlama: An Open-Source Small Language Model [24]	model
Snowballed: Mistral 7B [16]	model
Snowballed: The Flan Collection: Designing Data and Methods for Effective Instruction Tuning [37]	model
TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese [31]	model
Snowballed: The Surprising Power of Small Language Models [89]	model
Snowballed: Orca 2: Teaching Small Language Models How to Reason [47]	model
Snowballed: Llama 2: Open Foundation and Fine-Tuned Chat Models [9]	model
Snowballed: The Falcon Series of Open Language Models [90]	model
Snowballed: Gemini: A Family of Highly Capable Multimodal Models [36]	model
Towards Multi-Modal Mastery: A 4.5B Parameter Truly Multi-Modal Small Language Model [91]	model
Snowballed: Llama: Open and Efficient Foundation Language Models [45]	model
Blaze-IT and Flare-IT: Lightweight BERT Models for Italian [43]	model
Towards Pareto Optimal Throughput in Small Language Model Serving [92]	method
Leveraging Small Language Models for Text2SPARQL tasks to improve the resilience of AI assistance [93]	method
LitCab: Lightweight Language Model Calibration over Short- and Long-form Responses [55]	method
Enhancing Small Language Models via ChatGPT and Dataset Augmentation [63]	method
LQ-LORA: Low-Rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning [56]	method
Making Small Language Models Better Multi-task Learners with Mixture-of-Task-Adapters [59]	method
Modeling Overregularization in Children with SLMs	method
Open-source Small Language Models for personal medical assistant chatbots [61]	method
Towards a Small Language Models powered chain-of-reasoning for open-domain question answering [94]	method
Tiny Language Models Enriched with Multimodal Knowledge from Multiplex Networks [95]	method
Test-Time Self-Adaptive Small Language Models for Question Answering	method
Spatial Data Discovery Using Small Language Model [96]	method
Specializing Small Language Models Towards Complex Style Transfer via Latent Attribute Pre-Training [97]	method
Teaching Small Language Models to Reason for Knowledge-Intensive Multi-Hop Question Answering [98]	method
Mind’s Mirror: Distilling Self-Evaluation Capability and Comprehensive Thinking from Large Language Models [51]	method
Open-Source or Proprietary Language Models? An Initial Comparison on the Assessment of an Educational Task [99]	method
FASTNav: Fine-tuned Adaptive Small-language-models Trained for Multi-point Robot Navigation [58]	method
Hybrid Small Language Models and LLM for Edge-Cloud Collaborative Inference [100]	method
Aligning Large and Small Language Models via Chain-of-Thought Reasoning [65]	method
An emulator for fine-tuning Large Language Models using Small Language Models [101]	method
BLU-SynTra: Synergies and Trade-offs Between SDGs Using Small Language Models [102]	method
Can Small Language Models be Good Reasoners for Sequential Recommendation? [103]	method
Can Small Language Models Help Large Language Models Reason Better?: LM-Guided Chain-of-Thought [66]	method
Can Small Language Models With Retrieval-Augmented Generation Replace Large Language Models When Learning Computer Science? [67]	method
Chain-of-Thought in Neural Code Generation: From and for Lightweight Language Models [104]	method
COCONUT: Contextualized Commonsense Unified Transformers for Graph-Based Commonsense Augmentation of Language Models [105]	method
Combining Small Language Models and Large Language Models for Zero-Shot NL2SQL [29]	method
Could Small Language Models Serve as Recommenders? Towards Data-centric Cold-start Recommendation [106]	method
Deeploy: Enabling Energy-Efficient Deployment of Small Language Models on Heterogeneous Microcontrollers [60]	method
Teaching Small Language Models to Reason [72]	method
Distilling Mathematical Reasoning Capabilities into Small Language Models [49]	method
Distilling Multi-Step Reasoning Capabilities into Smaller Language Model [50]	method
Efficient Knowledge Distillation: Empowering Small Language Models with Teacher Model Insights [53]	method
EFTNAS: Searching for Efficient Language Models in First-Order Weight-Reordered Super-Networks [57]	method
Enhancing SLMs via ChatGPT and Dataset Augmentation [63]	method
Enhancing Small Medical Learners with Privacy-preserving Contextual Prompting [64]	method
Exploring Domain Robust Lightweight Reward Models based on Router Mechanism [34]	method
ZARA: Improving Few-Shot Self-Rationalization for Small Language Models [107]	method
FreeAL: Towards Human-Free Active Learning in the Era of Large Language Models [108]	method
From Complex to Simple: Unraveling the Cognitive Tree for Reasoning with Small Language Models [109]	method
From Large to Tiny: Distilling and Refining Mathematical Expertise for Math Word Problems with Weakly Supervision [30]	method
Knowledge-Augmented Reasoning Distillation for Small Language Models in Knowledge-Intensive Tasks [52]	method
Small Language Model Can Self-correct [110]	method

Table A2. List of abbreviations used throughout this paper.

Abbreviation	Full Form
7B	7 billion parameters
AI	Artificial Intelligence
API	Application Programming Interface
BBH	Big-Bench Hard
BERT	Bidirectional Encoder Representations from Transformers
Code Gen.	Code Generation
Content Gen.	Content Generation
CPU	Central Processing Unit
FFN	Feedforward Network
GELU	Gaussian Error Linear Unit
GeGLU	GELU–Gated Linear Unit
GPT	Generative Pre-trained Transformer
GPU	Graphics Processing Unit
GSM8K	Grade School Math 8K
IR	Information Retrieval
IoT	Internet of Things
LayerNorm	Layer Normalization
LLMs	Large Language Models
LoRA	Low-Rank Adaptation
MMLU	Massive Multitask Language Understanding
MoE	Mixture-of-Experts
NesyCD	Neural-Symbolic Collaborative Distillation
NL2SQL	Natural Language to Structured Query Language
NLP	Natural Language Processing
PRISMA	Preferred Reporting Items for Systematic Reviews and Meta-Analyses
QA	Question Answering
QA/IR/Class.	Question Answering/Information Retrieval/Classification
RAI	Responsible AI
Reason/Math	Reasoning and Mathematics
ReLU	Rectified Linear Unit
RMSNorm	Root Mean Square Normalization
RoPE	Rotary Positional Embeddings
SLR	Systematic Literature Review
SQL	Structured Query Language
SQuAD	Reading-comprehension Question Answering
SQuAD-IT	Italian Reading-comprehension Question Answering
SQuADv2	Reading-comprehension Question Answering
SuperGLUE	Advanced Natural Language Understanding benchmark
SVAMP	Algebraic Word Problems (reasoning)
SwiGLU	Sigmoid-Weighted Linear Unit
T5	Text-to-Text Transfer Transformer
TriviaQA	Open-domain trivia Question Answering
TTS	Test-Time Scaling
UK	United Kingdom
USA	United States of America
VRAM	Video Random Access Memory
WinoGrande	Coreference commonsense reasoning

References

Brants, T.; Popat, A.C.; Xu, P.; Och, F.J.; Dean, J. Large Language Models in Machine Translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL); Eisner, J., Ed.; Association for Computational Linguistics: Prague, Czech Republic, 2007; pp. 858–867. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar] [PubMed]
Radford, A.; Narasimhan, K. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://api.semanticscholar.org/CorpusID:49313245 (accessed on 15 July 2025).
Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Akhtar, N.; Barnes, N.; Mian, A. A Comprehensive Overview of Large Language Models. arXiv 2024, arXiv:2307.06435. [Google Scholar] [CrossRef]
Minaee, S.; Mikolov, T.; Nikzad, N.; Chenaghlu, M.; Socher, R.; Amatriain, X.; Gao, J. Large Language Models: A Survey. arXiv 2024, arXiv:2402.06196. [Google Scholar] [PubMed]
Hadi, M.U.; Qureshi, R.; Shah, A.; Irfan, M.; Zafar, A.; Shaikh, M.B.; Akhtar, N.; Wu, J.; Mirjalili, S.; Shah, M.; et al. A Survey on Large Language Models: Applications, Challenges, Limitations, and Practical Usage. Authorea Prepr. 2023, 1, 1–26. [Google Scholar] [CrossRef]
Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 2024, 15, 3641289. [Google Scholar] [CrossRef]
Schick, T.; Schütze, H. It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., Zhou, Y., Eds.; pp. 2339–2352. [Google Scholar] [CrossRef]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
Abdin, M.; Aneja, J.; Behl, H.; Bubeck, S.; Eldan, R.; Gunasekar, S.; Harrison, M.; Hewett, R.J.; Javaheripi, M.; Kauffmann, P.; et al. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv 2024, arXiv:2404.14219. [Google Scholar] [CrossRef]
Lu, Z.; Li, X.; Cai, D.; Yi, R.; Liu, F.; Zhang, X.; Lane, N.D.; Xu, M. Small Language Models: Survey, Measurements, and Insights. arXiv 2024, arXiv:2409.15790. [Google Scholar] [CrossRef]
Wang, F.; Zhang, Z.; Zhang, X.; Wu, Z.; Mo, T.; Lu, Q.; Wang, W.; Li, R.; Xu, J.; Tang, X.; et al. A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness. arXiv 2024, arXiv:2411.03350. [Google Scholar] [CrossRef]
Nguyen, C.V.; Shen, X.; Aponte, R.; Xia, Y.; Basu, S.; Hu, Z.; Chen, J.; Parmar, M.; Kunapuli, S.; Barrow, J.; et al. A Survey of Small Language Models. arXiv 2024, arXiv:2410.20011. [Google Scholar] [PubMed]
Garg, M.; Raza, S.; Rayana, S.; Liu, X.; Sohn, S. The Rise of Small Language Models in Healthcare: A Comprehensive Survey. arXiv 2025, arXiv:2504.17119. [Google Scholar] [CrossRef]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 Statement: An Updated Guideline for Reporting Systematic Reviews. BMJ 2021, 372, n71. Available online: https://www.bmj.com/content/372/bmj.n71.full.pdf (accessed on 15 July 2025). [CrossRef] [PubMed]
Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar] [CrossRef]
Lieberum, J.L.; Toews, M.; Metzendorf, M.I.; Heilmeyer, F.; Siemens, W.; Haverkamp, C.; Böhringer, D.; Meerpohl, J.J.; Eisele-Metzger, A. Large language models for conducting systematic reviews: On the rise, but not yet ready for use—A scoping review. J. Clin. Epidemiol. 2025, 181, 111746. [Google Scholar] [CrossRef] [PubMed]
Gartlehner, G.; Affengruber, L.; Titscher, V.; Noel-Storr, A.; Dooley, G.; Ballarini, N.; König, F. Single-reviewer abstract screening missed 13 percent of relevant studies: A crowd-based, randomized controlled trial. J. Clin. Epidemiol. 2020, 121, 20–28. [Google Scholar] [CrossRef] [PubMed]
DeepSeek Team. DeepSeek-V3 Technical Report. arXiv 2025, arXiv:2412.19437. [Google Scholar]
Allen-Zhu, Z.; Li, Y. Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws. arXiv 2024, arXiv:2404.05405. [Google Scholar] [CrossRef]
Subramanian, S.; Elango, V.; Gungor, M. Small Language Models (SLMs) Can Still Pack a Punch: A survey. arXiv 2025, arXiv:2501.05465. [Google Scholar]
Li, C.; Wang, W.; Hu, J.; Wei, Y.; Zheng, N.; Hu, H.; Zhang, Z.; Peng, H. Common 7B Language Models Already Possess Strong Math Capabilities. arXiv 2024, arXiv:2403.04706. [Google Scholar] [CrossRef]
Wu, C.; Song, Y. Scaling Context, Not Parameters: Training a Compact 7B Language Model for Efficient Long-Context Processing. arXiv 2025, arXiv:2505.08651. [Google Scholar] [CrossRef]
Zhang, P.; Zeng, G.; Wang, T.; Lu, W. TinyLlama: An Open-Source Small Language Model. arXiv 2024, arXiv:2401.02385. [Google Scholar]
Gemini Team. Gemma: Open Models Based on Gemini Research and Technology. arXiv 2024, arXiv:2403.08295. [Google Scholar] [CrossRef]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]
Tang, Y.; Wang, Y.; Guo, J.; Tu, Z.; Han, K.; Hu, H.; Tao, D. A Survey on Transformer Compression. arXiv 2024, arXiv:2402.05964. [Google Scholar] [CrossRef]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2020, arXiv:1910.01108. [Google Scholar] [CrossRef]
Fan, J.; Gu, Z.; Zhang, S.; Zhang, Y.; Chen, Z.; Cao, L.; Li, G.; Madden, S.; Du, X.; Tang, N. Combining Small Language Models and Large Language Models for Zero-Shot NL2SQL. Proc. VLDB Endow. 2024, 17, 2750–2763. [Google Scholar] [CrossRef]
Lin, Q.; Xu, B.; Huang, Z.; Cai, R. From Large to Tiny: Distilling and Refining Mathematical Expertise for Math Word Problems with Weakly Supervision. In Advanced Intelligent Computing Technology and Applications; Springer Nature: Singapore, 2024; pp. 251–262. [Google Scholar] [CrossRef]
Corrêa, N.K.; Falk, S.; Fatimah, S.; Sen, A.; De Oliveira, N. TeenyTinyLlama: Open-source tiny language models trained in Brazilian Portuguese. Mach. Learn. Appl. 2024, 16, 100558. [Google Scholar] [CrossRef]
Kim, H.; Kim, D.; Lee, J.; Yoon, C.; Choi, D.; Gim, M.; Kang, J. LAPIS: Language Model-Augmented Police Investigation System. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, New York, NY, USA, 21–25 October 2024; CIKM ’24. pp. 4637–4644. [Google Scholar] [CrossRef]
Magnini, M.; Aguzzi, G.; Montagna, S. Open-source small language models for personal medical assistant chatbots. Intell.-Based Med. 2025, 11, 100197. [Google Scholar] [CrossRef]
Namgoong, H.; Jung, J.; Jung, S.; Roh, Y. Exploring Domain Robust Lightweight Reward Models based on Router Mechanism. In Proceedings of the Findings of the Association for Computational Linguistics: ACL; ACL: Bangkok, Thailand, 2024; pp. 8644–8652. [Google Scholar] [CrossRef]
Zhang, C.; Wang, B.; Song, D. On Elastic Language Models. ACM Trans. Inf. Syst. 2024, 42, 3677375. [Google Scholar] [CrossRef]
Team, G. Gemini: A Family of Highly Capable Multimodal Models. arXiv 2024, arXiv:2312.11805. [Google Scholar]
Longpre, S.; Hou, L.; Vu, T.; Webson, A.; Chung, H.W.; Tay, Y.; Zhou, D.; Le, Q.V.; Zoph, B.; Wei, J.; et al. The Flan Collection: Designing Data and Methods for Effective Instruction Tuning. arXiv 2023, arXiv:2301.13688. [Google Scholar] [CrossRef]
Wang, L.; Zhang, H.; Xu, W.; Xue, Z.; Wang, Y. Deciphering the protein landscape with ProtFlash, a lightweight language model. Cell Rep. Phys. Sci. 2023, 4, 101600. [Google Scholar] [CrossRef]
Liu, Z.; Zhao, C.; Iandola, F.; Lai, C.; Tian, Y.; Fedorov, I.; Xiong, Y.; Chang, E.; Shi, Y.; Krishnamoorthi, R.; et al. MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases. arXiv 2024, arXiv:2402.14905. [Google Scholar]
Wang, Y.; Le, H.; Gotmare, A.D.; Bui, N.D.Q.; Li, J.; Hoi, S.C.H. CodeT5+: Open Code Large Language Models for Code Understanding and Generation. arXiv 2023, arXiv:2305.07922. [Google Scholar]
Joshi, H.; Ebenezer, A.; Sanchez, J.C.; Gulwani, S.; Kanade, A.; Le, V.; Radiček, I.; Verbruggen, G. FLAME: A small language model for spreadsheet formulas. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024. AAAI’24/IAAI’24/EAAI’24. [Google Scholar] [CrossRef]
Lemerle, T.; Obin, N.; Roebel, A. Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Kos Island, Greece, 1–5 September 2024; pp. 3420–3424. [Google Scholar] [CrossRef]
Russo, F.; Filannino, M. Blaze-IT: A lightweight BERT model for the Italian language. In Proceedings of the CLiC-it 2023: 9th Italian Conference on Computational Linguistics, Venice, Italy, 30 November–2 December 2023; Volume 3596. [Google Scholar]
Eldan, R.; Li, Y. TinyStories: How Small Can Language Models Be and Still Speak Coherent English? arXiv 2023, arXiv:2305.07759. [Google Scholar] [CrossRef]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Yang, A.; Yang, B.; Hui, B.; Zheng, B.; Yu, B.; Zhou, C.; Li, C.; Li, C.; Liu, D.; Huang, F.; et al. Qwen2 Technical Report. arXiv 2024, arXiv:2407.10671. [Google Scholar]
Mitra, A.; Corro, L.D.; Mahajan, S.; Codas, A.; Simoes, C.; Agarwal, S.; Chen, X.; Razdaibiedina, A.; Jones, E.; Aggarwal, K.; et al. Orca 2: Teaching Small Language Models How to Reason. arXiv 2023, arXiv:2311.11045. [Google Scholar] [CrossRef]
Bunzeck, B.; Zarrieß, S. GPT-wee: How Small Can a Small Language Model Really Get? In Proceedings of the CoNLL 2023-BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, Singapore, 6 December 2023; pp. 35–46. [Google Scholar]
Zhu, X.; Li, J.; Liu, Y.; Ma, C.; Wang, W. Distilling mathematical reasoning capabilities into Small Language Models. Neural Netw. 2024, 179, 106594. [Google Scholar] [CrossRef] [PubMed]
Yim, Y.; Wang, Z. Distilling Multi-Step Reasoning Capabilities into Smaller Language Model. In Proceedings of the ACM International Conference Proceeding Series; ACM: New York, NY, USA, 2024; pp. 530–535. [Google Scholar] [CrossRef]
Liu, W.; Li, G.; Zhang, K.; Du, B.; Chen, Q.; Hu, X.; Xu, H.; Chen, J.; Wu, J. Mind’s Mirror: Distilling Self-Evaluation Capability and Comprehensive Thinking from Large Language Models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2024, Mexico City, Mexico, 16–21 June 2024; Volume 1, pp. 6748–6763. [Google Scholar] [CrossRef]
Kang, M.; Lee, S.; Baek, J.; Kawaguchi, K.; Hwang, S.J. Knowledge-Augmented Reasoning Distillation for Small Language Models in Knowledge-Intensive Tasks. Adv. Neural Inf. Process. Syst. 2023, 36, 48573–48602. [Google Scholar]
Ballout, M.; Krumnack, U.; Heidemann, G.; Kühnberger, K.U. Efficient Knowledge Distillation: Empowering Small Language Models with Teacher Model Insights. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2024; Volume 14762 LNCS, pp. 32–46. [Google Scholar] [CrossRef]
Thakur, S.K.; Tyagi, N. Spatial Data Discovery Using Small Language Model. In Proceedings of the 2024 IEEE International Conference on Computing, Power and Communication Technologies (IC2PCT), Uttar Pradesh, India, 9–10 February 2024; Volume 5, pp. 899–905. [Google Scholar]
Liu, X.; Khalifa, M.; Wang, L. LITCAB: LIGHTWEIGHT LANGUAGE MODEL CALIBRATION OVER SHORT- AND LONG-FORM RESPONSES. In Proceedings of the 12th International Conference on Learning Representations, ICLR 2024, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Guo, H.; Greengard, P.; Xing, E.P.; Kim, Y. LQ-lora: Low-rank plus quantized matrix decomposition for efficient language model finetuning. In Proceedings of the 12th International Conference on Learning Representations, ICLR 2024, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Munoz, J.P.; Zheng, Y.; Jain, N. EFTNAS: Searching for Efficient Language Models in First-Order Weight-Reordered Super-Networks. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italia, 20–25 May 2024; Calzolari, N., Kan, M.Y., Hoste, V., Lenci, A., Sakti, S., Xue, N., Eds.; pp. 5596–5608. [Google Scholar]
Chen, Y.; Han, Y.; Li, X. FASTNav: Fine-Tuned Adaptive Small-Language- Models Trained for Multi-Point Robot Navigation. IEEE Robot. Autom. Lett. 2025, 10, 390–397. [Google Scholar] [CrossRef]
Xie, Y.; Wang, C.; Yan, J.; Zhou, J.; Deng, F.; Huang, J. Making Small Language Models Better Multi-Task Learners with Mixture-of-Task-Adapters. In Proceedings of the WSDM 2024-Proceedings of the 17th ACM International Conference on Web Search and Data Mining, Yucatán, Mexico, 4–8 March 2024; pp. 1094–1097. [Google Scholar] [CrossRef]
Scherer, M.; Macan, L.; Jung, V.J.B.; Wiese, P.; Bompani, L.; Burrello, A.; Conti, F.; Benini, L. Deeploy: Enabling Energy-Efficient Deployment of Small Language Models on Heterogeneous Microcontrollers. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2024, 43, 4009–4020. [Google Scholar] [CrossRef]
Haga, A.; Sugawara, S.; Fukatsu, A.; Oba, M.; Ouchi, H.; Watanabe, T.; Oseki, Y. Modeling Overregularization in Children with Small Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; ACL: Bangkok, Thailand, 2024; pp. 14532–14550. [Google Scholar] [CrossRef]
Das, S.; Chatterji, S.; Mukherjee, I. AcKnowledge: Acquired Knowledge Representation by Small Language Model without Pre-training. In KnowLLM 2024-1st Workshop on Towards Knowledgeable Language Models, Proceedings of the Workshop; Association for Computational Linguistics: Bangkok, Thailand, 2024. [Google Scholar]
Pieper, T.; Ballout, M.; Krumnack, U.; Heidemann, G.; Kühnberger, K.U. Enhancing Small Language Models via ChatGPT and Dataset Augmentation. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2024; Volume 14763 LNCS, pp. 269–279. [Google Scholar] [CrossRef]
Zhang, X.; Li, S.; Yang, X.; Tian, C.; Qin, Y.; Petzold, L.R. Enhancing small medical learners with privacy-preserving contextual prompting. In Proceedings of the 12th International Conference on Learning Representations, ICLR 2024, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Ranaldi, L.; Freitas, A. Aligning Large and Small Language Models via Chain-of-Thought Reasoning. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), St. Julian’s, Malta, 17–22 March 2024; Graham, Y., Purver, M., Eds.; pp. 1812–1827. [Google Scholar]
Lee, J.; Yang, F.; Tran, T.; Hu, Q.; Barut, E.; Chang, K.W.; Su, C. Can Small Language Models Help Large Language Models Reason Better?: LM-Guided Chain-of-Thought. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024-Main Conference Proceedings, Torino, Italy, 20–25 May 2024; pp. 2835–2843. [Google Scholar]
Liu, S.; Yu, Z.; Huang, F.; Bulbulia, Y.; Bergen, A.; Liut, M. Can Small Language Models With Retrieval-Augmented Generation Replace Large Language Models When Learning Computer Science? In Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1, New York, NY, USA, 5–10 July 2024; ITiCSE 2024. pp. 388–393. [Google Scholar] [CrossRef]
Patil, A. Advancing Reasoning in Large Language Models: Promising Methods and Approaches. arXiv 2025, arXiv:2502.03671. [Google Scholar]
Fan, T.; Wang, J.; Ren, X.; Huang, C. MiniRAG: Towards Extremely Simple Retrieval-Augmented Generation. arXiv 2025, arXiv:2501.06713. [Google Scholar]
Srivastava, G.; Cao, S.; Wang, X. Towards Reasoning Ability of Small Language Models. arXiv 2025, arXiv:2502.11569. [Google Scholar]
Hsieh, C.Y.; Li, C.L.; Yeh, C.K.; Nakhost, H.; Fujii, Y.; Ratner, A.; Krishna, R.; Lee, C.Y.; Pfister, T. Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. arXiv 2023, arXiv:2305.02301. [Google Scholar]
Magister, L.C.; Mallinson, J.; Adamek, J.; Malmi, E.; Severyn, A. Teaching Small Language Models to Reason. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Volume 2, pp. 1773–1781. [Google Scholar] [CrossRef]
Nadas, M.; Diosan, L.; Tomescu, A. Synthetic Data Generation Using Large Language Models: Advances in Text and Code. arXiv 2025, arXiv:2503.14023. [Google Scholar]
Gurgurov, D.; Vykopal, I.; van Genabith, J.; Ostermann, S. Small Models, Big Impact: Efficient Corpus and Graph-Based Adaptation of Small Multilingual Language Models for Low-Resource Languages. arXiv 2025, arXiv:2502.10140. [Google Scholar]
Li, J.; Cheng, X.; Zhao, W.X.; Nie, J.Y.; Wen, J.R. HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. arXiv 2023, arXiv:2305.11747. [Google Scholar]
Yang, Z.; Luo, X.; Han, D.; Xu, Y.; Li, D. Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key. arXiv 2025, arXiv:2501.09695. [Google Scholar]
Sun, C.; Li, Y.; Wu, D.; Boulet, B. OnionEval: An Unified Evaluation of Fact-conflicting Hallucination for Small-Large Language Models. arXiv 2025, arXiv:2501.12975. [Google Scholar]
Nguyen, H.; He, Z.; Gandre, S.A.; Pasupulety, U.; Shivakumar, S.K.; Lerman, K. Smoothing Out Hallucinations: Mitigating LLM Hallucination with Smoothed Knowledge Distillation. arXiv 2025, arXiv:2502.11306. [Google Scholar]
Lewis, A.; White, M.; Liu, J.; Koike-Akino, T.; Parsons, K.; Wang, Y. Winning Big with Small Models: Knowledge Distillation vs. Self-Training for Reducing Hallucination in QA Agents. arXiv 2025, arXiv:2502.19545. [Google Scholar]
Kim, J.; Kim, H.; Kim, Y.; Ro, Y.M. CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models. arXiv 2024, arXiv:2406.01920. [Google Scholar]
Giagnorio, A.; Mastropaolo, A.; Afrin, S.; Penta, M.D.; Bavota, G. Quantizing Large Language Models for Code Generation: A Differentiated Replication. arXiv 2025, arXiv:2503.07103. [Google Scholar]
Liu, R.; Gao, J.; Zhao, J.; Zhang, K.; Li, X.; Qi, B.; Ouyang, W.; Zhou, B. Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling. arXiv 2025, arXiv:2502.06703. [Google Scholar]
Liao, H.; He, S.; Xu, Y.; Zhang, Y.; Liu, K.; Zhao, J. Neural-Symbolic Collaborative Distillation: Advancing Small Language Models for Complex Reasoning Tasks. arXiv 2025, arXiv:2409.13203. [Google Scholar] [CrossRef]
Golechha, S.; Chaudhary, M.; Velja, J.; Abate, A.; Schoots, N. Modular Training of Neural Networks aids Interpretability. arXiv 2025, arXiv:2502.02470. [Google Scholar]
Yang, X.; Venhoff, C.; Khakzar, A.; de Witt, C.S.; Dokania, P.K.; Bibi, A.; Torr, P. Mixture of Experts Made Intrinsically Interpretable. arXiv 2025, arXiv:2503.07639. [Google Scholar]
Fields, C.; Kennington, C. Exploring Transformers as Compact, Data-efficient Language Models. In Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), Singapore, 6–7 December 2023; Jiang, J., Reitter, D., Deng, S., Eds.; pp. 521–531. [Google Scholar] [CrossRef]
Yang, D.; Wang, X.; Celebi, R. Expanding the Vocabulary of BERT for Knowledge Base Construction. In Proceedings of the LM-KBC’23: Knowledge Base Construction from Pre-trained Language Models, Challenge at ISWC 2023, Athens, Greece, 6–10 November 2023; Volume 3577. [Google Scholar]
Mukhamadiyev, A.; Mukhiddinov, M.; Khujayarov, I.; Ochilov, M.; Cho, J. Development of Language Models for Continuous Uzbek Speech Recognition System. Sensors 2023, 23, 1145. [Google Scholar] [CrossRef] [PubMed]
Bhosale, M.; Research, M. Phi-2: The Surprising Power of Small Language Models. Microsoft Res. Blog 2023, 1, 3. [Google Scholar]
Almazrouei, E.; Alobeidli, H.; Alshamsi, A.; Cappelli, A.; Cojocaru, R.; Debbah, M.; Goffinet, É.; Hesslow, D.; Launay, J.; Malartic, Q.; et al. The Falcon Series of Open Language Models. arXiv 2023, arXiv:2311.16867. [Google Scholar] [CrossRef]
Koska, B.; Horvath, M. Towards Multi-Modal Mastery: A 4.5B Parameter Truly Multi-Modal Small Language Model. In Proceedings of the 2024 2nd International Conference on Foundation and Large Language Models, FLLM 2024, Dubai, United Arab Emirates, 26–29 November 2024; pp. 587–592. [Google Scholar] [CrossRef]
Recasens, P.G.; Zhu, Y.; Wang, C.; Lee, E.K.; Tardieu, O.; Youssef, A.; Torres, J.; Berral, J.L. Towards Pareto optimal throughput in small language model serving. In Proceedings of the Workshop on Machine Learning and Systems; ACM: New York, NY, USA, 2024; pp. 144–152. [Google Scholar]
Brei, F.; Frey, J.; Meyer, L.P. Leveraging small language models for Text2SPARQL tasks to improve the resilience of AI assistance. In Proceedings of the D2R2’24: Third International Workshop on Linked Data-driven Resilience Research 2024, Hersonissos, Greece, 27 May 2024; Volume 3707. [Google Scholar]
Roh, J.; Kim, M.; Bae, K. Towards a small language model powered chain-of-reasoning for open-domain question answering. ETRI J. 2024, 46, 11–21. [Google Scholar] [CrossRef]
Fields, C.; Natouf, O.; McMains, A.; Henry, C.; Kennington, C. Tiny language models enriched with multimodal knowledge from multiplex networks. In Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, Singapore, 6–7 December 2023; pp. 47–57. [Google Scholar]
Jeong, S.; Baek, J.; Cho, S.; Hwang, S.J.; Park, J.C. Test-time self-adaptive small language models for question answering. arXiv 2023, arXiv:2310.13307. [Google Scholar]
Xu, R.; Huang, Y.; Chen, X.; Zhang, L. Specializing small language models towards complex style transfer via latent attribute pre-training. In ECAI 2023; IOS Press: Amsterdam, The Netherlands, 2023; pp. 2802–2809. [Google Scholar]
Li, X.; He, S.; Lei, F.; JunYang, J.; Su, T.; Liu, K.; Zhao, J. Teaching Small Language Models to Reason for Knowledge-Intensive Multi-Hop Question Answering. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024; ACL: Bangkok, Thailand, 2024; pp. 7804–7816. [Google Scholar]
Sterbini, A.; Temperini, M. Open-Source or Proprietary Language Models? An Initial Comparison on the Assessment of an Educational Task. In Proceedings of the 2024 21st International Conference on Information Technology Based Higher Education and Training (ITHET), Paris, France, 6–8 November 2024; pp. 1–7. [Google Scholar]
Hao, Z.; Jiang, H.; Jiang, S.; Ren, J.; Cao, T. Hybrid SLM and LLM for Edge-Cloud Collaborative Inference. In Proceedings of the Hybrid SLM and LLM for Edge-Cloud Collaborative Inference; ACM: New York, NY, USA, 2024; pp. 36–41. [Google Scholar] [CrossRef]
Mitchell, E.; Rafailov, R.; Sharma, A.; Finn, C.; Manning, C.D. An emulator for fine-tuning large language models using small language models. In Proceedings of the 12th International Conference on Learning Representations, ICLR 2024, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Bergeron, L.; Francois, J.; State, R.; Hilger, J. BLU-SynTra: Distinguish Synergies and Trade-offs between Sustainable Development Goals Using Small Language Models. In Proceedings of the Joint Workshop of the 7th Financial Technology and Natural Language Processing, the 5th Knowledge Discovery from Unstructured Data in Financial Services, and the 4th Workshop on Economics and Natural Language Processing; Chen, C.C., Liu, X., Hahn, U., Nourbakhsh, A., Ma, Z., Smiley, C., Hoste, V., Das, S.R., Li, M., Ghassemi, M., et al., Eds.; Association for Computational Linguistics: Torino, Italia, 2024; pp. 21–33. [Google Scholar]
Wang, Y.; Tian, C.; Hu, B.; Yu, Y.; Liu, Z.; Zhang, Z.; Zhou, J.; Pang, L.; Wang, X. Can Small Language Models be Good Reasoners for Sequential Recommendation? In Proceedings of the ACM Web Conference 2024, New York, NY, USA, 13–17 May 2024; WWW ’24. pp. 3876–3887. [Google Scholar] [CrossRef]
Yang, G.; Zhou, Y.; Chen, X.; Zhang, X.; Zhuo, T.Y.; Chen, T. Chain-of-Thought in Neural Code Generation: From and For Lightweight Language Models. arXiv 2024, arXiv:2312.05562. [Google Scholar] [CrossRef]
Park, J.H.; Lee, M.; Kim, J.; Lee, S. Coconut: Contextualized Commonsense Unified Transformers for Graph-Based Commonsense Augmentation of Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; ACL: Bangkok, Thailand, 2024; pp. 5815–5830. [Google Scholar] [CrossRef]
Wu, X.; Zhou, H.; Shi, Y.; Yao, W.; Huang, X.; Liu, N. Could Small Language Models Serve as Recommenders? Towards Data-centric Cold-start Recommendation. In Proceedings of the ACM Web Conference 2024, New York, NY, USA, 13–17 May 2024; WWW ’24. pp. 3566–3575. [Google Scholar] [CrossRef]
Chen, W.L.; Yen, A.Z.; Wu, C.K.; Huang, H.H.; Chen, H.H. ZARA: Improving few-shot self-rationalization for small language models. arXiv 2023, arXiv:2305.07355. [Google Scholar]
Xiao, R.; Dong, Y.; Zhao, J.; Wu, R.; Lin, M.; Chen, G.; Wang, H. FreeAL: Towards Human-Free Active Learning in the Era of Large Language Models. In Proceedings of the EMNLP 2023-2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 14520–14535. [Google Scholar] [CrossRef]
Yan, J.; Wang, C.; Zhang, T.; He, X.; Huang, J.; Zhang, W. From Complex to Simple: Unraveling the Cognitive Tree for Reasoning with Small Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023; ACL: Bangkok, Thailand, 2023; pp. 12413–12425. [Google Scholar] [CrossRef]
Han, H.; Liang, J.; Shi, J.; He, Q.; Xiao, Y. Small language model can self-correct. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; pp. 18162–18170. [Google Scholar]

Figure 1. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow diagram illustrating the study selection process. The entire process was guided by a single core inclusion criterion: papers must focus on Small Language Models (SLMs) with up to 7 billion parameters. This rigorous filtering, which began with 706 initial records and was supplemented by 15 from snowballing, resulted in the final set of 70 included studies.

Figure 2. Distribution of the 70 analyzed SLM studies across key dimensions. This figure illustrates the landscape of Small Language Model research. (Top-Left) The geographical distribution shows that research is predominantly led by institutions in the USA and China. (Top-Right) The primary application areas are question answering (QA), information retrieval (IR), and classification, followed by reasoning and math-related tasks. (Bottom-Left) A majority of SLM research is made publicly available, promoting open science. (Bottom-Right) Research is primarily conducted within academic institutions over commercial entities. These trends highlight a research landscape characterized by open, academic-led exploration focused on practical, information-centric tasks.

Figure 3. Distribution of SLM research across domains and temporal growth. (Left) Paper distribution across specific application domains, with healthcare leading. (Right) Publication growth from 2023 to 2024, showing nearly doubled output.

Figure 4. The distribution of model parameter sizes reveals a bimodal focus. The majority of research is concentrated on models under 1 billion parameters, emphasizing efficiency. A second significant cluster exists at the 7 billion parameter mark, a size often favored by commercial entities to balance performance and manageability.

Figure 5. The evolving landscape SLMs from 2023 to 2024. The timeline reveals a strong focus on two primary development tracks: highly efficient models under 1 billion parameters and powerful models in the 4–7 billion parameter range, with a majority around 7B parameters. The marked increase in new models between 2023 and 2024 shows the field’s rapid growth.

Figure 6. Research Challenges Presented by SLM Compression Techniques. The data show that knowledge distillation is the method most frequently associated with research challenges, particularly with generalization and scalability.

Figure 7. Prevalence of key architectural and optimization techniques in SLM research. This figure breaks down the design choices found in the analyzed studies. (Top-Left) The decoder-only architecture is the dominant choice, reflecting a focus on generative tasks. (Top-Right) Knowledge distillation is the most popular method for model compression, followed by quantization and Low-Rank Adaptation (LoRA), highlighting a trend toward transferring capabilities from larger models. (Bottom-Left) While standard multi-head attention remains common, more efficient variants like grouped-query attention are gaining traction. (Bottom-Right) RMSNorm is favored over standard LayerNorm, indicating a preference for computational efficiency in normalization layers. Collectively, these trends show a convergence in efficient, generative-focused designs.

Figure 8. Usage frequency of the top 10 SLM evaluation benchmarks, distinguished by paper type. ‘Model’ papers rely heavily on broad, established benchmarks like MMLU to validate general capabilities, whereas ‘method’ papers show more moderate usage of benchmarks targeting specific skills like GSM8K.

Figure 9. Visualizing the landscape of SLM research challenges. This heatmap reveals that while all application areas face hurdles, question answering, information retrieval, and classification (QA, IR, Class.) are the most problematic domains.

Table 1. A comparative analysis of Small Language Models (SLMs) and Large Language Models (LLMs).

Feature	SLMs	LLMs
Parameter Range	from ∼1 M–7B	≥100B
Examples	Mistral 7B, Llama-2 7B	Llama-3.1 450B
Performance	Competitive on specific tasks	Emergent capabilities (reasoning, logic)
Training Costs	Lower (less energy, fewer resources)	High (energy-intensive, expensive)
Inference Speed	Fast; real-time on edge devices	Slower; requires server-class hardware
Deployment	Mobile, edge, IoT	Cloud, data center
Use Case Fit	Domain-specific fine-tuning	Broad-domain, general-purpose
Architecture	Transformer (often decoder-only)	Transformer (decoder-heavy, sometimes mixed)
Availability	Many open-source models	Often proprietary or restricted

Table 2. Benchmarks analyzed and task types.

Benchmark	Type of Task
BBH	Multistep reasoning
MMLU	Multitask language understanding
Natural Questions (NQ)	Open-domain question answering
SQuADv2	Reading-comprehension QA
GSM8K	Mathematical reasoning
MultiArith	Arithmetic reasoning
OpenbookQA	Factual QA
PIQA	Physical commonsense reasoning
SIQA	Social commonsense reasoning
ARC	Scientific QA
BoolQ	Yes/No question answering
Humaneval	Code generation/debugging
MATH	Competition-style mathematics
MBPP	Code generation (intro)
QuAC	Conversational QA
TriviaQA	Open-domain trivia QA
WinoGrande	Coreference commonsense reasoning
SQuAD	Reading-comprehension QA
Anthropic Helpful–Harmless	Alignment/safety preference
HellaSwag	Commonsense reasoning/NLI
COPA	Causal commonsense reasoning
Date	Temporal reasoning
LAMBADA	Broad-context language modeling
RACE	Exam reading-comprehension QA
SciQ	Science QA
MULTI STS-B	Semantic textual similarity
SQuAD-IT	Italian reading-comprehension QA
UD Italian ISDT	Dependency parsing
WikiNER	Named-entity recognition
XGLUE NC	News classification
MTEB	Embedding evaluation (mixed tasks)
AGIEval	Standardised-exam QA
CRASS	Logical reasoning
DROP	Discrete numerical reasoning
MS-MARCO	Passage retrieval/ranking
MT-bench	Multi-turn dialogue evaluation
QMSum	Meeting summarisation
ToxiGen	Toxic-language detection
NDCG@10	Ranking metric (IR)
2WikiMultiHopQA	Multi-hop QA
HotpotQA	Multi-hop QA
TydiQA	Multilingual QA
CodeSearchNet	Code-to-text retrieval
CodeXGLUE	Code understanding/generation (suite)
JavaCorpus	Code language-modeling
MathQA-Python	Math reasoning with code execution
PY150	Code modeling
CommonSenseQA	Commonsense QA
NumerSense	Numerical commonsense reasoning
QASC	Science QA (abductive)
QuaRTz	Qualitative reasoning
RiddleSense	Riddle-style commonsense reasoning
GeoQuery	Semantic parsing → SQL
KaggleDBQA	Text-to-SQL
Spider	Text-to-SQL (complex)
ML-100K	Recommender systems
TAPE	Protein modeling
MLPerf Tiny	Edge-device ML performance
AlpacaEval	Instruction-following eval
ASDiv	Algebraic word problems
SVAMP	Algebraic word problems (reasoning)
ANLI	Adversarial NLI
CQA	Commonsense QA (variant)
RAI	Responsible AI
E-SNI	Explainable NLI
GLUE	General NLU benchmark
BLUE	Biomedical NLU benchmark
HeadQA	Medical QA (Spanish)
MedMCQA	Medical multiple-choice QA
MedQA	Medical-exam QA
SHP	Helpfulness preference pairs
BC5CDR	Biomedical NER
CoNLL03	Named-entity recognition
MR	Sentiment analysis
SST-2	Sentiment analysis
SUBJ	Subjectivity classification
TREC	Question-type classification
Math23K	Math word problems
BLIMP	Linguistic phenomena probing
MSGS	Morphological segmentation
WikiSQL	Text-to-SQL
StrategyQA	Strategic multi-hop QA
QAMPARI	List-type open-domain QA
WikiQA	Open-domain QA
C4	Language-model perplexity (pre-train)
SNLI	Natural-language inference
ROUGE-L	Summarization metric
G-Eval	LLM-as-judge eval metric
RealToxicity	Toxic-language detection
GPQA	Graduate-level physics QA
TruthfulQA	Truthfulness QA
AlignBench	Alignment evaluation benchmark
Arena-Hard	Hard preference-ranking eval
EvalPlus	Code-generation evaluation
IFEval	Instruction-following eval
LiveCodeBench	Live coding eval
MMLU-Pro	Professional-level MMLU
MixEval	Mixed-instruction eval
NeedleBench	Retrieval “needle-in-haystack” eval
Theorem QA	Theorem-proving QA
LibriTTS (test)	Speech-synthesis (TTS) test set
GYAFC	Style transfer (formal - informal)
MAWPS	Math word problems
SuperGLUE	Advanced NLU benchmark
Audio ASR	Automatic speech recognition
AudioCaps	Audio captioning
ChartQA	Visual QA (charts)
MMBench	Multimodal eval benchmark
MMMU	Multimodal multitask understanding
ScienceQA	Multimodal science QA
COMVE	Commonsense validation
E-SNL	Explainable NLI
ECQA	Explainable commonsense QA
SBIC	Social-bias identification

Table 3. An overview mapping the five core challenges in SLM research to their corresponding focus areas and proposed solution strategies. The table provides a structured breakdown of the techniques being developed to overcome these key obstacles in the field.

Challenge	Focus Area	Solutions Proposed
Scalability	Model Size Reduction	Weight pruning (sparsity), Ultra-low-bit quantization (4-bit/2-bit), Hardware-aware calibration
	Inference Efficiency	Test-Time Scaling (TTS), Iterative decoding, Majority-voting strategies
	Knowledge Transfer	Neural-Symbolic Collaborative Distillation, Skill-knowledge decoupling, External knowledge bases
Interpretability	Model Transparency	Cluster-ability loss objectives, Disentangled MoE experts, Non-overlapping experts
	Decision Explanation	Explanation validation, Transparency, Interpretability, Modularity, Reasoning benchmarks, Human-in-the-loop evaluation
Hallucinations	Evaluation Methods	OnionEval unified framework, DiaHalu dialogue benchmark, Context-influence scoring
	Truthfulness Training	Teacher–student distillation, “I don’t know” uncertainty prompts, Rationalized training
	Generation Control	Knowledge-overshadowing analysis, Contrastive Decoding (CODE), Fact amplification
Generalization	Architecture Design	Mixture-of-Experts, Retrieval-augmented generation, Multi-task instruction tuning
	Knowledge Distillation	Teacher–student and Chain-of-thought distillation, Orca explanation tuning
	Domain Adaptation	Meta-learning, Domain-specific adapters, LoRA-based adaptation
Data Scarcity	Synthetic Data	LLM-generated corpora, Automated quality filtering, Pipeline-driven corpora
	Low-Resource Learning	Parameter-efficient fine-tuning (PEFT), Cross-lingual transfer, Meta-learning with sparse data
	Benchmark Creation	Knowledge-base integration, Linguistic resource enrichment, Low-resource scenarios

Table 4. Mapping of optimization techniques to the five core challenges in Small Language Model research.

Technique	Generalization	Data Avail.	Hallucinations	Scalability	Interpretability
Knowledge Distillation [49,50,51,52,53]	✓	✓	✓	✓
LoRA/Adapter Tuning [56,58,59]	✓	✓		✓
Quantization (8-bit, 4-bit) [54,55,56,60]				✓
Chain-of-Thought Training [47,65,71]	✓		✓		✓
Synthetic Data Generation [44,63,73]		✓
Contrastive Decoding (CODE) [78]			✓
Pruning/Sparse Models [57]				✓
Mixture-of-Experts (MoE) [10,85]	✓			✓	✓
Modular Architectures [84]					✓
RAI Benchmarks/Rationalized Evaluation [47,60,75]			✓		✓

Note: Techniques highlighted in green are the most frequently adopted across the literature.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Corradini, F.; Leonesi, M.; Piangerelli, M. State of the Art and Future Directions of Small Language Models: A Systematic Review. Big Data Cogn. Comput. 2025, 9, 189. https://doi.org/10.3390/bdcc9070189

AMA Style

Corradini F, Leonesi M, Piangerelli M. State of the Art and Future Directions of Small Language Models: A Systematic Review. Big Data and Cognitive Computing. 2025; 9(7):189. https://doi.org/10.3390/bdcc9070189

Chicago/Turabian Style

Corradini, Flavio, Matteo Leonesi, and Marco Piangerelli. 2025. "State of the Art and Future Directions of Small Language Models: A Systematic Review" Big Data and Cognitive Computing 9, no. 7: 189. https://doi.org/10.3390/bdcc9070189

APA Style

Corradini, F., Leonesi, M., & Piangerelli, M. (2025). State of the Art and Future Directions of Small Language Models: A Systematic Review. Big Data and Cognitive Computing, 9(7), 189. https://doi.org/10.3390/bdcc9070189

Article Menu

State of the Art and Future Directions of Small Language Models: A Systematic Review

Abstract

1. Introduction

2. Methodology for the Systematic Literature Review

2.1. Inclusion and Exclusion Criteria

2.2. Selection Process

2.3. Paper Overview

3. Small Language Model Background Overview

3.1. Definition

3.2. Fundamentals

3.3. Attention Mechanisms

3.4. Feedforward Network

3.5. Layer Normalization and Residual Connections

3.6. Parameter Compression and Reduction in Transformers

4. RQ1: What Are the Types of Papers, Applications, and Public Availability Related to Small Language Models?

4.1. Our Findings in Context

4.2. Theoretical and Practical Implications

4.3. Synthesis and Takeaways for RQ1

5. RQ2: What Are the Most Prevalent Architectures and Their Associated Compression and Optimization Methods in Current Research?

5.1. Our Findings in Context

5.2. Theoretical and Practical Implications

5.3. Synthesis and Takeaways for RQ2

6. RQ3: What Are the Most Common Benchmarks?

6.1. Our Findings in Context

6.2. Theoretical and Practical Implications

6.3. Synthesis and Takeaways for RQ3

7. RQ4: What Are the Current Challenging Areas?

7.1. Our Findings in Context

7.2. Theoretical and Practical Implications

7.3. Synthesis and Takeaways for RQ4

8. Future Directions and New Solutions for SLM Challenges

8.1. Enhancing Generalization

8.2. Overcoming Data Scarcity

8.3. Mitigating Hallucinations

8.4. Improving Scalability

8.5. Advancing Interpretability and Explainability

9. Limitations of the Study

9.1. Scope and Definitional Boundaries

9.2. Search and Selection Bias

9.3. Analysis and Interpretation Limitations

10. Conclusions

10.1. Divergent Contributions of Academia and Industry

10.2. Emerging Designs for Efficient Model Architecture

10.3. The Emphasis on Reasoning and Responsible AI in SLM Benchmarks

10.4. Research Challenges and Clear Solution Pathways

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI