Content Selection Behavior of an Edge Small Language Model Under Word-Budget Compression: A Cross-Lingual Study of English and Thai Person Descriptions

Lertyosbordin, Chacharin; Iamruttanawong, Krittitee

doi:10.3390/app16125754

Open AccessArticle

Content Selection Behavior of an Edge Small Language Model Under Word-Budget Compression: A Cross-Lingual Study of English and Thai Person Descriptions

by

Chacharin Lertyosbordin

^1,*,†

and

Krittitee Iamruttanawong

²

¹

National Defence Studies Institute, Bangkok 10400, Thailand

²

Bangkok Christian College, Bangkok 10500, Thailand

^*

Author to whom correspondence should be addressed.

^†

Current Address: School of Information Technology, King Mongkut’s Institute of Technology Ladkrabang, Bangkok 10520, Thailand.

Appl. Sci. 2026, 16(12), 5754; https://doi.org/10.3390/app16125754

Submission received: 30 April 2026 / Revised: 26 May 2026 / Accepted: 29 May 2026 / Published: 8 June 2026

(This article belongs to the Special Issue Practical Applications of Large Language Models in Natural Language Processing)

Download

Browse Figures

Versions Notes

Abstract

The deployment of small language models (SLMs) at the edge raises a fundamental behavioral question: under a fixed word budget, how do these models prioritize semantic content? This paper investigates content selection in two representative edge SLMs—gemma3n-e4b and Llama-3.2-3B—under systematic word-budget compression when generating person descriptions in English and Thai. A 2 × 2 (model × language) factorial design yielded 9360 observations across 18 profiles, 13 budget levels (1–25 words, odd-step intervals), and 10 trials per cell. Output richness was quantified via an Information Density Score (IDS), a binary annotation-based metric capturing seven attributes: name, occupation, gender, education, income, and two personality traits. Results confirm strong, generally increasing IDS–budget relationships across all conditions (Pearson r = 0.663–0.903). In English, gemma3n-e4b reached every key threshold exactly four words ahead of Llama-3.2-3B (B50: w = 7 vs. 11; B70: w = 13 vs. 17; B80: w = 17 vs. 21). In Thai, both models converged at B50 (w = 9), but gemma retained a four-word lead at B70 (w = 13 vs. 17), while Llama never reached B80 (max IDS = 0.796 at w = 25). Occupation was the invariant anchor across all conditions (retention ≥ 0.969). Models diverged on secondary attributes: gemma suppressed Gender (4.8% English, 24.2% Thai), while Llama deprioritized Education (23.6% English, 15.2% Thai; 0.267 at w = 25 in Thai). In Thai, gemma’s personality traits overtook Income in the salience hierarchy—a reordering absent in English. Gemma showed 2.5–3.8× higher within-trial consistency than Llama in both languages. These findings indicate that content selection under budget pressure is determined primarily by model architecture rather than linguistic context, and that Thai reduces but does not eliminate the cross-architecture compression efficiency gap.

Keywords:

small language model (SLM); edge deployment; content selection; word-budget compression; cross-lingual behavior

1. Introduction

The rapid miniaturization of neural language models has enabled their deployment directly on mobile devices, embedded systems, and other edge hardware, bypassing the cloud infrastructure on which large-scale language models depend [1]. This shift from cloud-hosted to on-device inference carries significant practical implications: edge deployment preserves user privacy, reduces latency, and enables operation in bandwidth-constrained or offline environments [2,3]. At the same time, it introduces a class of generation constraint that has received comparatively little systematic study—the word budget, a hard upper limit on the length of generated text imposed by display area, bandwidth quota, battery consumption, or downstream pipeline capacity.

Under a word budget, a language model can no longer include every piece of relevant information; it must implicitly or explicitly select what to include and what to omit. This selection process is not random. It reflects an internal prioritization—a content salience hierarchy—that is encoded in the model’s weights and shaped by its pre-training and instruction-tuning data. Understanding this hierarchy is important for any application in which the model generates person descriptions, medical summaries, biographical sketches, job-candidate profiles, or any structured description where completeness and coverage matter [4,5].

The study of information salience in language models is an emerging research direction. Trienes et al. [5] recently provided the first behavioral analysis of information salience in large language models, demonstrating that LLMs systematically prioritize certain types of information over others. Ravaut et al. [6] examined how LLMs utilize context in summarization tasks, and Lin and Zeldes [7] introduced entity-level salience evaluation across diverse text genres. These contributions establish that LLM content selection is structured and measurable—but they focus almost exclusively on large, cloud-hosted models and on the English language. The behavior of compact, edge-deployable SLMs under hard word-budget constraints, particularly in non-English languages, remains largely uncharacterized.

Two SLMs merit particular attention in this context. Google’s gemma3n-e4b [8] is a multimodal edge model based on the MatFormer nested architecture with Per-Layer Embedding caching, providing an effective 4-billion-parameter profile while running within the memory and computational constraints of consumer mobile hardware. Meta’s Llama-3.2-3B [9] is a 3.21-billion-parameter instruction-tuned model in the Llama 3 lineage, explicitly designed and optimized for on-device inference, with multilingual support that includes Thai. Both models represent the frontier of publicly available edge SLMs as of 2025, and their behavioral comparison provides an opportunity to examine how architectural differences translate into differences in content selection strategy.

Choosing Thai as the second language alongside English is motivated by three considerations. First, Thai presents a range of orthographic and morphological properties—no word-boundary spaces, tonal phonology, and a rich system of person reference through kinship terms and pronouns—that differ sharply from English and that may interact with budget constraints in language-specific ways [10,11]. Second, existing work on multilingual LLM summarization is heavily skewed toward European languages and Mandarin Chinese [12,13], leaving typologically distinct languages such as Thai comparatively under-represented in benchmarks. Third, there is active practical demand for Thai-language NLP applications in privacy-preserving and edge-deployed contexts [14,15], and understanding how SLMs behave in Thai under resource constraints directly informs the design of such systems.

The core behavioral question this paper addresses is the following: when an edge SLM is given only a small number of words in which to describe a person, what does it say, and what does it leave out? We operationalize this question through the Information Density Score (IDS), a binary-attribute-coverage metric grounded in information-theoretic principles [16], which counts how many of seven predefined person attributes (name, occupation, gender, education, income, and two personality traits) appear in a generated summary. By systematically varying the word budget from 1 to 25 words in 13 odd- numbered steps and collecting 10 independent trials per profile per budget level, we obtain a dense empirical picture of each model’s compression behavior across 18 distinct person profiles in both English and Thai.

The cross-lingual, cross-model design also provides the opportunity to ask a second question that is central to multilingual NLP: do the behavioral signatures of budget-constrained content selection generalize across languages? Recent work has shown that multilingual LLMs often process non-English input by translating to English internally [17,18], that language-specific latent structures affect cross-lingual performance [19], and that models exhibit different levels of stability across languages on the same task [20]. Whether these cross-lingual phenomena affect the content selection hierarchies of edge SLMs under budget compression is a question this paper is positioned to answer.

2. Research Questions

This study is organized around five research questions.

RQ1: How does word budget affect the Information Density Score of edge SLM-generated person descriptions in English and Thai, and what is the functional form of the IDS–budget relationship?
RQ2: What minimum word budget is required for each model and each language to cross the 50%, 70%, and 80% IDS thresholds (B50, B70, B80), and how do these thresholds compare across model-language combinations?
RQ3: What attribute salience hierarchy does each model exhibit under word-budget compression—that is, which semantic attributes are retained first and which are consistently omitted—and does this hierarchy differ between English and Thai?
RQ4: How does within-trial and between-profile consistency of content selection differ between models and between languages?
RQ5: To what extent do the content selection behaviors of gemma3n-e4b and Llama-3.2-3B converge or diverge across English and Thai, and what do these differences reveal about the role of model architecture versus linguistic context in determining content selection strategy?

3. Research Objectives

Corresponding to the research questions above, this study has five objectives.

O1: To characterize the statistical relationship between word budget and IDS in each of four model-language conditions using Pearson and Spearman correlation, one-sample t-tests against the IDS = 0.5 baseline at each budget level (Bonferroni-corrected), and adjacent-budget step significance analysis.
O2: To identify the minimum word budget threshold for achieving B50, B70, and B80 mean IDS in each condition, and to quantify the cross-condition gap in compression efficiency.
O3: To compute the overall and budget-stratified attribute retention rates for all seven semantic attributes in each condition, to rank attributes by salience, and to identify the structural ceiling imposed by the lowest-retention attribute in each condition.
O4: To measure within-cell trial consistency via within-cell standard deviation, and inter-profile variation via profile-level mean IDS, in each condition.
O5: To produce a systematic cross-model, cross-language comparison of content selection behavior and to interpret divergences in terms of model architecture, training data, and linguistic properties of the target language.

4. Contributions

This paper makes the following contributions to the literature.

First behavioral study of edge SLM content selection under word-budget compression. Prior work on information salience in LLMs [5,7] does not address the compressed-output, hard-budget regime that characterizes edge deployment.
First cross-lingual analysis of SLM person-description compression behavior. By including both English and Thai, and by collecting the same 2340-run experimental design in each language, we provide the first controlled comparison of how an edge SLM’s content selection hierarchy shifts—or does not shift—across a typologically distant language pair.
Cross-architecture replication. By conducting the same experiment on gemma3n-e4b and Llama-3.2-3B, two architecturally distinct edge SLMs from different developer organizations, we provide the first direct cross-model comparison of compression behavior in both English and Thai simultaneously.
Quantitative B-score framework. The B50/B70/B80 threshold framework introduced here provides a practical, interpretable metric for characterizing minimum budget requirements, analogous to the recall-at-k metrics used in information retrieval.
Open dataset and analysis code. All 9360 generated summaries and annotation data are released publicly [21] to support reproducibility and future benchmarking of edge SLM compression behavior.

5. Literature Review

This section provides a theoretical background on text summarization and content selection in language models, the capabilities and behavioral characteristics of small language models at the edge, cross-lingual variation in language model behavior, and the linguistic properties of Thai that are relevant to person-description generation, establishing the context for the research gaps addressed in this study.

5.1. Text Summarization, Content Selection, and the Problem of What Gets Left Out

Automatic text summarization has been transformed by large language models. The comprehensive survey by Zhang et al. [4] traces this evolution from rule-based extractive systems, through neural abstractive pipelines, to the present era in which instruction-tuned LLMs generate summaries in a zero-shot regime across domains and languages. LLMs bring two decisive advantages over earlier systems: a vastly broader semantic prior from large-scale pretraining, and instruction-following capability that allows the model to adapt summary style, length, and focus to explicit user directives. Yet Zhang et al. [4] also document a failure mode: despite extensive instruction tuning, LLMs frequently violate explicit word-count constraints and produce outputs that are fluent but semantically incomplete.

The standard evaluation paradigm for summarization quality measures surface-level similarity between a generated summary and a reference: ROUGE scores n-gram overlap, and BERTScore measures contextual embedding proximity [4]. These metrics conflate lexical fidelity with semantic completeness and are blind to the question of which attributes a summary covers. Minaee et al. [1] note that this evaluation gap has become one of the central methodological tensions of the field. This gap between “quality of expression” and “completeness of content” motivates the binary attribute-coverage approach of the Information Density Score used in the present study, which is grounded in the information-theoretic principle that informational content is a function of how many distinguishable states a message resolves [16].

The clearest evidence that LLM content selection is structured comes from the behavioral analysis of Trienes et al. [5], the work most directly related to the present study. Trienes et al. demonstrate that LLMs exhibit consistent, model-specific salience hierarchies: when generating or reducing text, they systematically retain certain types of information (named entities, main events, causal relations) while dropping others (quantitative hedges, secondary attributes, sociodemographic details). Crucially, however, their study characterizes large, cloud-hosted LLMs and operates exclusively in English. Whether the same structured selectivity appears in compact, edge-deployable models constrained to hard word budgets—and whether it survives the shift to a typologically distinct language—remains an open question that the present study addresses.

Ravaut et al. [6] add a positional dimension: their analysis shows that LLMs exhibit input-positional bias, preferentially retaining information from the beginning and end of the input document. In the present study, we control for this confound by structuring the input profile as a flat, order-balanced list in which no attribute has a systematic positional advantage.

Entity-level salience analysis has been studied by Lin and Zeldes [7] across 12 English genres through the GUMsley benchmark. Their key finding is that models over-retain statistically frequent entity types and under-retain semantically central but less frequent ones. The near-perfect Occupation retention and persistently low Education retention that we observe across all four conditions is consistent with this frequency-salience coupling. The structured evaluation approach of Gero et al. [22], who show in the clinical domain that decomposing evaluation into named attribute categories is more reliable and actionable than holistic scoring, provides the direct methodological precedent for the IDS framework.

The stakes of attribute omission extend beyond measurement methodology into consequential decision-making. Deroy et al. [23] demonstrate that LLMs summarizing legal judgments systematically omit minority holdings and procedural details—attributes that are legally consequential but narratively less salient. More pointedly, Seshadri et al. [24] show that even small asymmetries in how demographic attributes are represented in LLM-generated candidate profiles produce statistically significant shifts in subsequent human evaluation scores in hiring contexts.

5.2. Small Language Models at the Edge: Capabilities, Constraints, and Behavioral Gaps

The growing deployment of language models on mobile and embedded hardware has produced a distinct model class—small language models (SLMs)—that differ from cloud-hosted LLMs not only in scale but in the computational and behavioral trade-offs imposed by the edge environment. Lu et al. [2] provide a systematic survey and measurement study of SLMs, defining this class operationally by the requirement that inference fit within consumer hardware constraints. Their benchmark reveals a non-linear capability–scale relationship: the degradation in instruction-following, multilingual, and complex-reasoning capabilities accelerates steeply as parameter count decreases below approximately 3 billion. Wang et al. [3] extend this characterization to the techniques used to build SLMs and document their differential effects on multilingual capability.

Meta’s Llama 3.2 [9], released in September 2024, comprises 1B and 3B parameter variants built on the standard Llama transformer architecture and optimized for mobile inference. Multilingual support including Thai is a stated design objective [9]. Google’s gemma3n-e4b [8] represents a fundamentally different architectural philosophy, grounded in the MatFormer (nested Transformer) design of Kudugunta et al. [25]. Rather than deploying a single fixed-size model, the MatFormer architecture embeds multiple sub-models of decreasing computational cost within a single shared weight matrix by nesting Transformer FFN blocks. A single universal model thereby yields hundreds of distinct submodels without retraining or distillation. For gemma3n-e4b specifically, Per-Layer Embedding (PLE) caching pre-loads the most frequently accessed embedding layers into faster memory, reducing memory bandwidth requirements during inference.

Beyond the two models studied here, recent work establishes that SLMs in the 1B–7B parameter range can match much larger models on specific well-defined tasks. Xie et al. [26] demonstrate with the InfiR framework that compact SLMs achieve competitive reasoning performance through structured training curricula, while Mansha [27] shows that LoRA-based fine-tuning of LLaMA-3.2-3B on a single consumer GPU remains feasible. Xu et al. [28] find that Llama-3.2-3B-Instruct achieves news summarization quality competitive with 70B models on standard metrics—yet their analysis simultaneously reveals that SLMs show higher variance on length compliance and content completeness. The SlimLM system of Pham et al. [29] demonstrates that on-device document assistance is feasible on mid-range mobile hardware, while healthcare [30,31] and privacy-preserving NER [32] applications document the range of real-world contexts in which SLM edge deployment is motivated.

A specific concern for multilingual edge deployment is the effect of model quantization. Marchisio et al. [33] demonstrate that quantizing multilingual LLMs produces differential degradation across language pairs, with low-resource and typologically distant languages experiencing larger performance drops.

5.3. Cross-Lingual Behavior: What Changes When the Language Changes

The assumption that a multilingual model’s behavioral characteristics transfer uniformly across languages has been systematically challenged. Schut et al. [17] show that the intermediate layer representations of multilingual LLMs for non-English inputs converge toward English representational geometry before being mapped back to the target language in the output layer. Zhao et al. [18] identify language-specific attention heads through PLND neuron detection: while higher-level semantic processing is English-centric, lower-level morphological and syntactic processing engages language-specific circuits. This two-level structure predicts a specific pattern in our data—semantic-level content selection (the choice of what to describe) should be relatively stable across English and Thai, while surface-level encoding (the choice of how to express it) should show language-specific variation.

Lim et al. [19] show that language-specific latent processes can actively impede cross-lingual transfer, particularly for languages that are structurally and typologically distant from English. Zhang et al. [34] add a prompt-level consideration: the language of the instruction prompt can modulate content coverage. Nemkova et al. [20] document that instruction-tuned models exhibit different demographic coverage patterns in humanitarian NLP across languages. Zhao et al. [35] show directly that gender bias in LLMs differs systematically across languages and does not scale uniformly with cross-lingual performance degradation. Shi et al. [36] and Cotterell et al. [37] together establish the baseline expectation that multilingual reasoning capability is present in these models but uneven across languages. Goldman et al. [38] further show through the ECLeKTic benchmark that effective knowledge availability varies across languages, and Gupta et al. [39] confirm that multilingual LLMs are not genuinely multilingual thinkers.

5.4. Thai Natural Language Processing: Why Thai Is a Meaningful Test Case

Thai is written in scriptio continua—continuous script with no inter-word spaces—which means that word boundary detection is a prerequisite for all downstream NLP [10]. The PyThaiNLP toolkit [15] documents additional layers of Thai orthographic complexity. For the purposes of person-description generation, the segmentation challenge is directly relevant to Name retention: Thai personal names consist of multiple syllables, and a model that fails to segment the name as a single named entity may decompose it into its constituent morphemes.

Beyond orthography, Thai grammar is radically analytic: there is no morphological marking for tense, number, or grammatical gender. Gender is encoded optionally through personal pronouns (เขา for he/she/they, เธอ for she, or kinship terms), and its inclusion is always a pragmatic choice rather than a grammatical requirement. This structural optionality is the likely explanation for the cross-lingual asymmetry in Gender retention.

Thai is also a radical pro-drop language: subject noun phrases may be omitted whenever the referent can be recovered from discourse context [40]. Intratat [41] provides corpus evidence that 43.14% of subject positions are filled by zero anaphors in written Thai—a rate that makes null reference the dominant strategy. Pathanasin and Aroonmanakun [42] demonstrate through Centering Theory analysis that zero anaphors in Thai target texts outnumber overt references in corresponding English source texts, and that this asymmetry is strongest when the most salient discourse entity (the topic of a person description) persists across turns. For the present study, these properties have direct consequences for Name retention: a person’s name, once introduced, is the canonical most-salient entity in a person description, and subsequent references in a discourse-coherent Thai text will tend to be zero anaphors rather than repeated mentions of the name [41,42].

The transformer-based QA system of Phakmongkol and Vateekul [43] demonstrates that sequence-to-sequence models handle Thai grammar adequately for comprehension tasks, but at approximately 70% F1—notably lower than 90%+ reported for English. The Thai capability benchmarks of Kim et al. [11] reveal that cultural entities—occupation titles, social role descriptors, personality attributions—are systematically underrepresented in the training data of most multilingual models. The practical relevance of Thai-language edge NLP is made concrete by Thetbanthad et al. [14], whose application of LLMs to Thai PDPA-compliant PII redaction on pharmaceutical labels operates on exactly the set of personal attributes that forms the IDS attribute set of the present study.

5.5. Toward a Behavioral Characterization of Edge SLM Content Selection

The literature reviewed across the preceding four sections converges on a single diagnostic question: when an edge SLM is given a hard word budget and asked to describe a person, what does it include, and why? This question is not answered by the LLM summarization literature [4,5,6,7], which studies large cloud models in English without hard budget variation; not by the SLM literature [2,3,8,9,26,27,28,29,30,31,32], which characterizes average benchmark performance rather than attribute-level behavioral fingerprints; not by the cross-lingual literature [13,17,18,19,20,34,35,36,38], which characterizes large multilingual models; nor is it addressed by technical reports on modern SLM architectures [44] or studies focusing on practical downstream edge frameworks [45], as neither systematically investigates content selection hierarchies under word-budget compression; and not by the Thai NLP literature [10,11,14,15,43], which characterizes analysis tasks rather than generation under resource constraints. The Thai linguistics research reviewed in Section 5.4—pro-drop grammar [40], the corpus prevalence of zero anaphora [41], and the discourse-structural conditions under which subject NPs are suppressed in translation [42]—additionally indicates that the behavioral profile of an edge SLM generating Thai person descriptions reflects structurally licensed differences in reference strategy.

The attribute-level behavioral analysis conducted in this study is also motivated by two bodies of work on the consequences of LLM content selection. An et al. [46] demonstrate through a large-scale audit in PNAS Nexus that LLMs in resume evaluation tasks reproduce societal biases by differentially retaining professional attributes while underrepresenting sociodemographic attributes, and that this selective retention is amplified rather than attenuated under summary compression. Seshadri et al. [24] reinforce this point: the content that an LLM-based system chooses to omit from a person profile under length pressure has measurable downstream effects on how that person is evaluated.

The Information Density Score adopted as the primary measurement instrument connects these practical concerns to the theoretical framework of Shannon information theory [16]: a person description that covers k out of 7 attribute dimensions resolves k independent dimensions of uncertainty about the described individual, and IDS = k/7 directly quantifies this informational completeness. The present study is, to the authors’ knowledge, the first to combine (a) systematic word-budget variation across 13 levels, (b) attribute-level IDS measurement, (c) two architecturally distinct edge SLMs, and (d) both English and Thai as output languages, in a single controlled factorial design.

6. Methodology

This section details the experimental design, participant materials, model configurations, text generation procedure, information density measurement framework, annotation protocol, and statistical analysis plan employed to investigate content selection behavior under word-budget compression across two edge SLMs and two languages.

6.1. Experimental Design Overview

This study employs a 2 × 2 fully crossed factorial design with model (gemma3n-e4b, Llama-3.2-3B) and language (English, Thai) as the between-condition factors. Within each of the four conditions, the word budget is treated as a within-condition independent variable systematically varied across 13 levels. The dependent variable is the Information Density Score (IDS) of each generated output, defined formally in Section 6.5.

Each condition comprises 2340 observations derived from 18 person profiles × 13 budget levels × 10 independent trials, yielding 9360 total observations across the full study. The design is balanced: every profile appears at every budget level, and every budget level receives the same number of trials.

Figure 1 provides a high-level overview of the complete experimental pipeline from input preparation through statistical analysis, organized as five sequential stages: (1) input profiles in both languages, (2) prompt instruction with word-budget parameter, (3) SLM text generation across 13 budget levels and 10 trials per cell, (4) binary attribute annotation and IDS computation, and (5) statistical analysis yielding the attribute retention hierarchy per model–language condition.

6.2. Person Profile Construction

6.2.1. Profile Set

Eighteen fictional Thai person profiles were constructed to serve as the generation targets. Each profile specifies exactly seven attributes: (1) name, (2) occupation, (3) gender, (4) educational level, (5) monthly income (in Thai Baht), (6) first personality trait, and (7) second personality trait. The profiles were designed to achieve diversity across all seven dimensions. All profiles were originally constructed in Thai and independently translated into English by a bilingual researcher, with back-translation verification to ensure semantic equivalence. The full profile set is released in the study’s public repository [21].

6.2.2. Budget Level Selection

The word budget levels used in this study are 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, and 25 words—thirteen levels at odd-numbered intervals from 1 to 25. The minimum budget of 1 word tests the most compressed possible generation regime. The maximum budget of 25 words was selected to exceed the word count at which the model’s IDS first plateaus in preliminary generation trials conducted prior to the main experiment. To illustrate the feasibility of the upper bound, a 25-word English description can in principle include all seven attributes in natural language (e.g., “Somchai, a male electrical engineer with a bachelor’s degree, earns 40,000 baht monthly and is known for being meticulous and compassionate”—21 words, seven attributes present).

6.3. Model Selection and Technical Specifications

Two edge SLMs were selected to represent distinct points in the current edge-deployment design space. Technical specifications of the two edge SLMs shown in Table 1.

The gemma3n-e4b (Google, 2025) [8] is based on the MatFormer (Matryoshka Transformer) architecture [25,44], a nested design in which a single set of parameters simultaneously instantiates multiple sub-models of decreasing depth and width. At inference time, the model routes computation through the largest sub-model that fits within the available memory budget. The Per-Layer Embedding (PLE) caching mechanism pre-loads the embedding matrices for frequently accessed token types into faster memory tiers. The effective parameter count in the e4b variant is approximately 4 billion parameters; the runtime memory footprint is approximately 2 GB [8].

Llama-3.2-3B-Instruct (Meta, 2024) [9] is the instruction-tuned variant of the 3.21-billion-parameter Llama 3.2 model family. The architecture follows the standard Llama transformer design with grouped-query attention (GQA), rotary positional embeddings (RoPE), and SwiGLU activation functions. Unlike gemma3n-e4b, Llama-3.2-3B does not employ a nested computation design; all 3.21 billion parameters participate in every forward pass. The model’s instruction tuning incorporates supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) [9].

6.4. Text Generation Procedure

6.4.1. Prompt Design

Each generation call was structured as a two-turn chat prompt conforming to the instruction-tuning format of both models: a system message delivering the generation instruction with the word-budget parameter, and a user message delivering the input person profile as a prose paragraph.

The English-condition system message template was: “Read the following sentence and summarize it into a sentence, strictly limited to {w} words.”

The Thai-condition system message template was: “จงอ่านวลีต่อไปนี้แล้วสรุปเป็นประโยคจำกัดความยาว {w} คำเท่านั้น”.

The user message in both conditions was the full person profile presented as a prose paragraph containing all seven attributes in natural language. An example is shown in Figure 2.

6.4.2. Generation Settings

Both models were run under identical settings across all conditions. A sampling temperature of 0.5 was applied throughout. A fixed maximum token limit of 150 was applied to all generation calls regardless of the target word budget. No nucleus sampling parameter (top-p) and no repetition penalty beyond the model default were applied. Each (profile, budget) cell received 10 independent runs; inter-trial variation within a cell reflects genuine sampling stochasticity.

6.5. Information Density Score: Definition and Operationalization

6.5.1. Conceptual Grounding

The Information Density Score is designed to measure how many of the seven available information dimensions about a person are represented in a generated summary, independently of how that information is expressed. The conceptual foundation is Shannon’s notion of resolvable uncertainty [16]: each attribute that a summary mentions resolves one dimension of uncertainty about the described person. IDS = k/7, where k is the number of attributes present, directly quantifies this dimension-coverage ratio.

This formulation makes IDS orthogonal to surface quality metrics: a summary can be perfectly fluent while covering few attributes (low IDS), or slightly awkward while covering all seven attributes (high IDS). The present study is explicitly concerned with what the model chooses to say, not how well it expresses it.

6.5.2. Attribute Operationalization

The seven attributes and their annotation criteria are defined as follows. Each attribute is scored 1 (present) if any of the specified surface cues appear in the generated output, and 0 (absent) otherwise. An example of how the seven attributes appear in a sample profile is shown in Figure 2.

Attribute 1—Name: The output is scored 1 if any form of the person’s name appears: full name, given name alone, family name alone, or nickname as specified in the profile. A score of 0 is assigned if the person is referred to only by pronoun or generic descriptor.
Attribute 2—Occupation: A score of 1 is assigned if the person’s job title, professional role, or workplace function is mentioned. Near-synonymous role descriptors are accepted.
Attribute 3—Gender: A score of 1 is assigned if any of the following appear: (a) a gendered pronoun that unambiguously refers to the target person; (b) an explicit statement of gender; or (c) a gendered social or professional title.
Attribute 4—Education: A score of 1 is assigned if any mention of educational attainment, degree level, educational institution, or field of academic study appears. The job title “doctor” alone does not satisfy this criterion.
Attribute 5—Income: A score of 1 is assigned if a specific salary figure, income range, income tier, or explicit description of financial status appears.
Attribute 6—Trait 1: A score of 1 is assigned if the first personality trait specified in the profile, or a clear synonym or paraphrase of it, appears as a description of the person’s character.
Attribute 7—Trait 2: The same scoring criteria as Attribute 6 apply, directed to the second personality trait specified in the profile.

6.5.3. IDS Computation

For a generated output O with attribute presence vector A = (a₁, a₂, …, a₇) where each a_i ∈ {0, 1}, the IDS is computed as:

I D S (O) = \frac{1}{7} \sum_{i} a_{i}

IDS takes values in the set {0, 1/7, 2/7, 3/7, 4/7, 5/7, 6/7, 1}, corresponding to {0.000, 0.143, 0.286, 0.429, 0.571, 0.714, 0.857, 1.000}.

The B-score thresholds (B50, B70, B80) are derived from the IDS–budget curve as:

B_{k} = \min {w ∣ mean IDS (w) \geq k}, where k \in {0.5,0.7,0.8} .

These thresholds answer the engineering design question—“what minimum word budget is required to achieve mean coverage level k?”—rather than the primary measurement question.

6.6. Statistical Analysis Plan

All statistical analyses were conducted in Python (version 3.10) using the SciPy and Pandas libraries. The statistics output included:

Correlation analysis: Pearson’s r and Spearman’s ρ between word budget and IDS were computed across all 2340 observations per condition.
One-sample t-tests at each budget level: For each of the 13 budget levels, a one-sample t-test was conducted testing H₀: mean IDS = 0.5. A Bonferroni correction was applied for the 13 simultaneous tests, yielding a corrected significance threshold of α = 0.05/13 = 0.003846.
Adjacent-budget step significance tests: For each consecutive pair of budget levels, an independent-samples t-test was conducted testing H₀: mean IDS(w₁) = mean IDS(w₂). The same Bonferroni-corrected threshold of α = 0.003846 was applied.
B-score threshold identification: For each condition, the B50, B70, and B80 thresholds were identified as Bₖ = min{w: group mean IDS(w) ≥ k}.
Within-cell consistency: For each of the 234 profile × budget cells per condition, the standard deviation of IDS across the 10 trials was computed.
Profile-level variation: The mean IDS averaged across all 13 budget levels and 10 trials (130 observations per profile) was computed for each of the 18 profiles.

6.7. Generation and Annotation Pipeline

The complete experimental pipeline is illustrated in Figure 1. For each of the four model–language conditions, data collection proceeded as a fully enumerated triple loop over 18 profiles, 13 budget levels, and 10 independent trials. At each iteration, a two-turn chat prompt was constructed, submitted to the model with temperature = 0.5 and max_tokens = 150, and the resulting output text was stored together with its profile identifier, budget level, trial index, and condition label. Following generation, each output was independently scored by the language-matched native annotator across the seven binary attributes. All scored records were appended to a master dataset, yielding 9360 records in total. Profile order was fixed across all trials within a condition to ensure reproducibility. Output word counts were verified post-generation; all outputs conformed to the target word budget. The full implementation is available at [21].

6.8. Hardware and Software Environment

All text generation was performed on the hardware and software configuration reported in Table 2. Both models were loaded and served locally through LM Studio 0.4.9 using its OpenAI-compatible REST API endpoint; generation calls were submitted programmatically via Python scripts targeting this local endpoint.

7. Results

This section reports the empirical findings of the 2 × 2 (model × language) experiment across all 9360 observations. Section 7.1 addresses the functional form of the IDS–budget relationship (RQ1); Section 7.2 addresses minimum budget thresholds (RQ2); Section 7.3 addresses attribute salience hierarchies (RQ3); Section 7.4 addresses within-trial consistency and profile-level variation (RQ4); and Section 7.5 synthesizes the cross-model and cross-lingual comparisons (RQ5). All t-tests are one-sample tests against the chance-level baseline of IDS = 0.5, with Bonferroni correction applied over 13 budget levels (α = 0.05/13 = 0.003846). Effect sizes are reported as Cohen’s d = (M − 0.5)/SD for one-sample tests and d = ΔM/SD_pooled for adjacent-budget comparisons.

7.1. IDS–Budget Relationships (RQ1)

7.1.1. Overall Magnitude and Monotonicity

Across all four model-language conditions, the Information Density Score increased monotonically with word budget (Figure 3). The IDS–budget correlation was strong and statistically significant at p < 2.2 × 10⁻¹⁶ in every condition (Table 3). The Pearson and Spearman coefficients were nearly equal in each condition, confirming that the relationship is approximately rank-linear rather than driven by distributional outliers. However, the strength of this relationship varied substantially across conditions. The notably lower correlation for Llama-3.2-3B in Thai (r = 0.663, ρ = 0.670) indicates that word budget is a substantially less reliable predictor of output information density in Thai for this model than in any other condition.

7.1.2. Per-Budget Mean IDS

Table 4 presents the mean IDS at each of the 13 budget levels for all four conditions; the corresponding curves are plotted in Figure 3. All four conditions start well below IDS = 0.5 at w = 1 and rise through the budget range, but their trajectories differ in shape, steepness, and plateau onset. Three structural features are especially notable. First, the two models’ curves are nearly identical at w = 1 in English (gemma: 0.159; Llama: 0.143) but diverge immediately thereafter. Second, in Thai, Llama-3.2-3B begins at a substantially higher IDS at w = 1 (0.427) than gemma3n-e4b (0.143), a reversal that is inverted by w = 5. Third, gemma3n-e4b plateaus visibly earlier than Llama-3.2-3B in both languages. Asterisked (*) values indicate the first budget meeting the B50, B70, and B80 thresholds within each condition.

7.1.3. Steepest Gain Intervals and Inflection Points

The four conditions differ in the location of their steepest per-word gain. For gemma3n-e4b in English, the fastest gain occurs at the w = 3 to 5 step (ΔIDS = +0.169, d = +1.79, p < 0.001). For Llama-3.2-3B in English, the fastest gain is at w = 1 to 3 (ΔIDS = +0.129, d = +2.97, p < 0.001). For gemma3n-e4b in Thai, the fastest gain is also the w = 1 to 3 step (ΔIDS = +0.211, d = +3.19, p < 0.001)—the single largest adjacent-step gain across the entire study—driven by simultaneous early saturation of Occupation and both personality traits at w = 3. For Llama-3.2-3B in Thai, the steepest monotone gain is at w = 7 to 9 (ΔIDS = +0.074); the w = 1 to 3 step is non-monotone (ΔIDS = −0.048, reported in Section 7.1.4).

The inflection point—defined as the first budget level at which mean IDS is significantly above 0.5 after Bonferroni correction—occurs at w = 9 for gemma3n-e4b in English, at w = 11 for Llama-3.2-3B in English, and at w = 9 for both models in Thai.

7.1.4. Non-Monotone Steps

Despite the globally monotone trend, three non-monotone adjacent-budget steps were observed. For gemma3n-e4b in English, the w = 23 to 25 step produced a slight IDS decrease (ΔIDS = −0.011, d = −0.17, n.s.). For gemma3n-e4b in Thai, the w = 13 to 15 step produced a small IDS decrease (ΔIDS = −0.006, d = −0.05, n.s.), driven by a Name-retention dip at this budget level. For Llama-3.2-3B in Thai, the w = 1 to 3 step produced an IDS decrease of ΔIDS = −0.048 (d = −0.30, n.s. after Bonferroni correction), reflecting a distinctive early-budget pattern in Thai in which the model’s single-word output at w = 1 incidentally covers multiple attributes simultaneously. All three non-monotone steps were non-significant after Bonferroni correction.

7.2. Minimum Budget Thresholds: B-Score Analysis (RQ2)

Table 5 presents the minimum word budget required to achieve mean IDS at or above the 50%, 70%, and 80% thresholds (B50, B70, B80) in each condition. The B90 threshold was not reached by any of the four conditions within the tested budget range.

In English, gemma3n-e4b requires exactly four fewer words than Llama-3.2-3B to reach every threshold: B50 at w = 7 versus w = 11, B70 at w = 13 versus w = 17, and B80 at w = 17 versus w = 21. This uniform four-word gap across all three thresholds indicates a systematic and proportional compression efficiency advantage for gemma3n-e4b in English, rather than an isolated effect confined to any particular density level.

In Thai, the 4-word rule is partially preserved but modified by the language context. Both models share the B50 threshold at w = 9, suggesting equivalent minimum capacity to achieve majority-attribute coverage in Thai. From B70 onward, however, the 4-word gap re-emerges: gemma3n-e4b reaches B70 at w = 13 while Llama-3.2-3B requires w = 17. At B80, the gap widens sharply: gemma3n-e4b crosses B80 at w = 17, while Llama-3.2-3B never reaches mean IDS ≥ 0.80 within the tested range.

7.3. Attribute Salience Hierarchies (RQ3)

7.3.1. Overall Retention Rates

Table 6 presents the overall mean attribute retention rate—averaged across all 13 budget levels and all 18 profiles—for each of the seven semantic attributes in all four conditions; the values are visualized in Figure 4. The structural floor—the lowest-retention attribute—differs between the two models: Gender is the structural floor for gemma3n-e4b in both languages, and Education is the structural floor for Llama-3.2-3B in both languages.

7.3.2. Occupation as the Invariant Content Anchor

Occupation was the highest-retention attribute in all four conditions without exception, with overall retention ranging from 0.969 (both models in Thai) to 0.994 (gemma3n-e4b in English). At w = 1—the most extreme compression condition—gemma3n-e4b in English produced Occupation in 94.4% of single-word outputs, and Llama-3.2-3B in English produced Occupation in 90.0% of single-word outputs. Occupation retention exceeded 0.90 in all four conditions at every budget from w = 3 onward (lowest value: 0.911 for Llama-3.2-3B in Thai at w = 5), and reached or exceeded 0.969 in all four conditions at the overall (budget-averaged) level. This places Occupation consistently at the top of the attribute salience hierarchy across model, language, and budget.

7.3.3. Model-Specific Suppression: Gender (Gemma) and Education (Llama)

Each model consistently deprioritizes a different attribute across both languages. For gemma3n-e4b, Gender is the structural floor attribute: overall retention of 0.048 in English and 0.242 in Thai. At the maximum budget of w = 25, gemma3n-e4b’s Gender retention reaches only 0.139 in English and 0.356 in Thai. For Llama-3.2-3B, Education is the structural floor attribute: overall retention of 0.236 in English and 0.152 in Thai. Llama’s Education retention peaks at 0.800 in English at w = 25 but reaches only 0.267 in Thai at w = 25.

These suppression patterns are cross-linguistically stable for each model: Gemma consistently suppresses Gender in both English and Thai, and Llama consistently suppresses Education in both languages. The cross-model double dissociation is equally consistent: Llama’s overall Gender retention in Thai is 0.708 (ranking third), compared with Gemma’s 0.242 (ranking last); Gemma’s overall Education retention in English is 0.644, compared with Llama’s 0.236.

7.3.4. Budget-Stratified Attribute Retention

Table 7 presents attribute retention rates at five representative budget levels—Table 7a: w = 5 and w = 9; Table 7b: w = 13, w = 19, and w = 25—to illustrate how the salience hierarchy evolves as budget increases; Figure 5 shows the complete per-budget trajectory for all seven attributes across all four conditions.

Several patterns in Table 7 warrant specific attention. At the low-budget level of w = 5, gemma3n-e4b has committed to Occupation and is already accumulating Income and Trait scores, but Name is entirely absent from its English outputs (0.000) and near-absent from its Thai outputs (0.061). Llama-3.2-3B, by contrast, produces Name in more than half its English outputs at this budget (0.528) and in more than two-thirds of its Thai outputs (0.711). Furthermore, gemma3n-e4b in Thai shows a pronounced Trait 1 priority at w = 5–9 (Trait 1 = 0.961 at w = 5 in Thai), a pattern that does not occur with similar strength in English (Trait 1 = 0.533 at w = 5 in English). Finally, Education remains near zero for Llama-3.2-3B in both languages across the entire low-to-mid budget range.

7.4. Within-Trial Consistency and Profile-Level Variation (RQ4)

7.4.1. Within-Cell Consistency

Table 8 presents the mean within-cell standard deviation—the mean SD of IDS across 10 independent trials for the same (profile, budget) cell—for all four conditions.

The gemma3n-e4b is consistently more reproducible than Llama-3.2-3B across both languages. In English, Llama’s mean within-cell SD of 0.0735 is 2.51 times that of gemma’s 0.0293. In Thai, this ratio widens to 3.80, with Llama SD = 0.1279 versus gemma SD = 0.0337. gemma3n-e4b maintains near-equivalent consistency across languages, while Llama-3.2-3B’s consistency degrades substantially when generating in Thai. In English, five outputs of Llama-3.2-3B were classified as refusals or hallucinations (0.21% of 2340 runs), all scored IDS = 0.000. No refusals or hallucinations were observed for gemma3n-e4b in either language.

7.4.2. Profile-Level Variation

Table 9 presents the minimum, maximum, and range of profile-level mean IDS for each condition.

Profile-level variation is moderate and comparable across conditions. The range of profile means spans approximately 0.15–0.24 IDS units, indicating that profile content difficulty has a real but bounded effect on output attribute coverage. Profile 14 yields the highest mean IDS in both English conditions, suggesting that its attribute composition is inherently more compressible. Across all conditions, profile-level variation is substantially smaller than the between-budget variation, which spans approximately 0.70 IDS units from w = 1 to plateau.

7.5. Cross-Model and Cross-Lingual Comparison (RQ5)

7.5.1. The English 4-Word Rule

In English, the B-score analysis reveals a precise and uniform four-word compression efficiency advantage for gemma3n-e4b over Llama-3.2-3B at every density threshold—B50 at w = 7 versus w = 11, B70 at w = 13 versus w = 17, and B80 at w = 17 versus w = 21. The regularity of this gap across three different density levels spanning a 14-word range indicates that the compression efficiency difference is not confined to any single density level. Alongside its earlier threshold-crossing, gemma3n-e4b reaches an effective plateau in English by w = 21, with IDS unchanged from w = 21 to w = 23 (both 0.852). Llama-3.2-3B, by contrast, continues to improve through w = 25 (reaching 0.878 in English, the highest mean IDS achieved by any condition at any budget within the tested range).

7.5.2. Thai: B50 Convergence and Widening Gap at Higher Thresholds

In Thai, both models share the same B50 threshold at w = 9, a convergence that does not occur in English. This convergence is produced by different attribute profiles: gemma3n-e4b reaches B50 in Thai by combining early Occupation saturation (1.000 at w = 5, the earliest budget level reported in Table 7a) with rapid Trait 1 and Trait 2 accumulation, while Llama-3.2-3B reaches B50 in Thai via early and consistent Name and Gender retention combined with Occupation. From B70 onward the two models diverge decisively.

7.5.3. Attribute Salience: Convergence and Divergence

Both models agree on Occupation primacy in both languages. Beyond this single convergence point, the models diverge sharply. First, gemma3n-e4b suppresses Gender in both languages (0.048 EN, 0.242 TH) while Llama-3.2-3B retains Gender at moderate-to-high rates (0.430 EN, 0.708 TH). Second, Llama-3.2-3B suppresses Education in both languages (0.236 EN, 0.152 TH) while gemma3n-e4b retains it at substantially higher rates (0.644 EN, 0.457 TH). Third, gemma3n-e4b in Thai produces a salience order in which both personality traits (Trait 1 = 0.905, Trait 2 = 0.865) substantially outrank personal Name (0.430), while Llama-3.2-3B does not show this pattern.

7.5.4. Consistency: Results Summary

The gemma3n-e4b’s within-cell consistency advantage over Llama-3.2-3B is present in both languages and wider in Thai: 2.51 times in English and 3.80 times in Thai. gemma3n-e4b’s mean within-cell SD increases by only 0.0044 from English to Thai (0.0293 to 0.0337), while Llama-3.2-3B’s increases 1.74-fold across the same language shift (0.0735 to 0.1279).

7.5.5. Consolidated Cross-Condition Summary

Table 10 consolidates the key metrics from all four conditions for direct comparison. Three cross-condition contrasts are immediately apparent. First, gemma3n-e4b achieves a higher overall mean IDS than Llama-3.2-3B in both languages (EN: 0.635 vs. 0.588; TH: 0.646 vs. 0.608), reflecting its earlier B-score thresholds and lower structural floor. Second, Llama-3.2-3B shows a pronounced within-model dissociation in budget-responsiveness: its English condition yields the strongest Pearson r across all four conditions (0.903), while its Thai condition yields the weakest (0.663), a contrast absent in gemma3n-e4b (EN: 0.878; TH: 0.866). Third, floor attributes are model-specific and cross-linguistically stable—Gender for gemma3n-e4b (0.048 EN; 0.242 TH) and Education for Llama-3.2-3B (0.236 EN; 0.152 TH)—while consistency favors gemma3n-e4b by a factor of 2.51× in English and 3.80× in Thai.

8. Discussion

This section interprets the empirical findings reported in Section 7 in relation to the five research questions and situates them within the existing literature.

8.1. Budget as the Primary Driver of Content Selection Behavior

The strong monotonic IDS–budget correlations observed across all four conditions (r = 0.663–0.903) confirm that word budget is a primary and measurable driver of information density in edge SLM outputs. This finding directly replicates and extends the core insight of Trienes et al. [5], who demonstrated that LLM content selection is structured and measurable, and of Ravaut et al. [6], who showed that models are systematically selective about which portions of an input they draw upon.

The functional relationship between budget and IDS is not merely linear. In all four conditions, the marginal gain per word is highest at the lowest budget levels and decreases as budget increases, consistent with a diminishing-returns curve in which the model first commits to its highest-priority attributes and then adds lower-priority ones incrementally. This front-loaded compression structure is most pronounced for gemma3n-e4b.

The three non-monotone adjacent-budget steps observed in the data warrant brief discussion. For gemma3n-e4b in English, the slight IDS decrease at w = 23 to 25 (ΔIDS = −0.011) occurs well within the plateau zone and is attributable to natural stochastic variation in Name retention. For gemma3n-e4b in Thai, the IDS dip at w = 13 to 15 (ΔIDS = −0.006) may be connected to the orthographic complexity of Thai personal names in the context of scriptio continua writing [10]. For Llama-3.2-3B in Thai, the w = 1 to 3 IDS decrease (ΔIDS = −0.048) reflects a structurally different phenomenon: at w = 1, the model incidentally covers Name, Gender, and Income as a side-effect of its single-token Thai output, whereas at w = 3 the model initiates a more deliberate structured description. All three non-monotone steps were non-significant after Bonferroni correction.

8.2. Compression Efficiency and the 4-Word Rule

The uniform 4-word gap at every B-score threshold in English is one of the most structurally regular findings in this study. The regularity of the same gap across three thresholds spanning a 14-word range argues against a profile-specific or threshold-specific explanation and points toward a systematic architectural difference.

The gemma3n-e4b’s MatFormer nested architecture [8] and Per-Layer Embedding caching [44] represent a fundamentally different computational strategy from the standard transformer used in Llama-3.2-3B [9]. The MatFormer design, in which a single set of parameters instantiates multiple sub-models of decreasing depth and width, may produce a generation process that is more tightly coupled to the word-count constraint. Wang et al. [3] note that SLM capability-size trade-offs are non-linear, and the present result suggests that this non-linearity extends to the dimension of compression efficiency.

The four-word rule is partially preserved in Thai but erased at B50, where both models tie at w = 9. This language-conditional preservation is consistent with the interpretation that the architectural efficiency gap operates at the level of semantic compression strategy, which is broadly language-independent. It is also important to note that while gemma3n-e4b achieves each B-score threshold earlier, Llama-3.2-3B achieves a higher raw IDS at the maximum tested budget in English (0.878 versus 0.841 at w = 25). Xu et al. [28] similarly observed that SLM performance on standard quality metrics can converge at high budgets even when compression behavior differs substantially at lower budgets. We note that the four-word rule rests on three threshold observations (B50, B70, B80); a formal statistical test of gap uniformity (e.g., permutation test) would further strengthen this claim and is left for future work.

8.3. Occupation Primacy as a Universal Content Anchor

The near-universal prioritization of Occupation across all four conditions—retention exceeding 0.969 in every condition, reaching 1.000 at most mid-to-high budget levels, and dominating the w = 1 output in both models and both languages—constitutes the most robust behavioral regularity in this study. This finding is consistent with the information-theoretic principle underlying the IDS framework: of all seven attributes in a person description, Occupation has the highest discriminative information value, since knowing a person’s professional role substantially reduces uncertainty about their other characteristics [16].

From the perspective of entity salience research, this finding aligns with the observation of Lin and Zeldes [7] that entity types with high semantic connectivity in a document tend to receive disproportionately high salience scores. Gero et al.’s [22] decomposed attribute evaluation framework for clinical text similarly found that professional role descriptors were the most consistently included attribute across compression conditions.

The cross-linguistic stability of Occupation primacy—identical retention rates of 0.969 in both models in Thai—suggests that Occupation’s primacy reflects a learned functional heuristic that is encoded deeply enough in both architectures to transfer robustly across languages.

8.4. Attribute Suppression Hierarchies and Their Implications

The model-specific structural floors—Gender for gemma3n-e4b and Education for Llama-3.2-3B, each constituting near-zero retention in at least one condition—represent the most theoretically significant behavioral contrast between the two architectures. These suppression patterns are not budgetary artifacts: at the maximum tested budget of w = 25, gemma3n-e4b’s Gender retention reaches only 0.139 in English and 0.356 in Thai, while Llama-3.2-3B’s Education retention reaches only 0.800 in English and 0.267 in Thai. The persistence of these patterns across both languages and across the full budget range suggests that the suppression is encoded in the model’s weights rather than arising from a language-specific generation failure.

From a fairness perspective, these suppression patterns carry implications that extend beyond technical characterization. An et al. [46] demonstrated that automated resume evaluation systems suppress demographic attribute representation in ways that produce systematically inequitable outcomes. Seshadri et al. [24] showed that subtle differences in how demographic attributes are represented in LLM outputs produce allocational bias in hiring contexts. The present findings document a structurally similar phenomenon: when gemma3n-e4b generates a person description under budget constraints, it consistently omits Gender information regardless of how much budget is available.

Zhao et al.’s [35] finding that gender bias patterns in LLMs are not uniform across languages is directly corroborated by the present data: Gender retention for gemma3n-e4b increases from 0.048 in English to 0.242 in Thai, while Gender retention for Llama-3.2-3B increases from 0.430 in English to 0.708 in Thai. Both models suppress Gender less in Thai than in English, consistent with the prediction derived from Thai’s explicit gender-encoding mechanisms. However, Gemma’s Gender retention in Thai (0.242) remains dramatically lower than Llama’s (0.708), indicating that language-specific gender-encoding cannot be relied upon as a fairness safeguard when the underlying model has a strong architectural tendency to deprioritize gender information.

8.5. Cross-Lingual Modulation of Content Selection Behavior

The cross-lingual comparison reveals a two-level structure of behavioral variation: at the semantic level, both models show consistent attribute prioritization patterns across languages (Occupation first, model-specific structural floor last); at the surface-encoding level, both models show language-dependent modulation of intermediate-rank attributes (Gender, Name, Traits).

One possible interpretation is that this reflects a processing architecture in which semantic content selection is performed at a language-independent representational level, while surface-level encoding decisions are influenced by the specific lexical and morphosyntactic properties of the output language. This hypothesis is consistent with Schut et al. [17] and Zhao et al. [18]. Under this interpretation, the cross-linguistic stability of Occupation primacy and the model-specific structural floors would be evidence of stable semantic-level commitments, while the cross-linguistic variation in Gender and Trait retention would reflect surface-level differences. It must be emphasized, however, that this account is one possible interpretation and the present experimental design does not allow direct verification of the internal processing mechanism.

The most structurally novel cross-lingual finding concerns gemma3n-e4b’s personality-trait salience in Thai. Gemma consistently prioritizes personality traits above Name in both languages—a cross-architectural contrast with Llama-3.2-3B, which maintains Name in second rank in both languages. What changes between English and Thai for gemma is the magnitude of this effect and, critically, a genuine cross-language reordering: in English, Income outranks both personality traits (0.797 vs. 0.751/0.725); in Thai, both traits substantially outrank Income (0.905/0.865 vs. 0.657), constituting a structural reversal of the Income–Trait ordering between languages. Llama-3.2-3B shows no Income–Trait reordering of comparable magnitude: Name remains in second rank in both languages, and the Income–Trait ordering shifts by less than 0.03 IDS units across languages.

Corpus-linguistic evidence provides direct structural support for this interpretation. Intratat [41] documented that zero anaphora accounts for 43.14% of subject positions in written Thai. Phimsawat [40] classifies Thai as a radical pro-drop language in which both subjects and objects, including personal names, may be systematically dropped when the referent is contextually inferrable. Pathanasin and Aroonmanakun [42] demonstrate through Centering Theory analysis that proper names in Thai discourse are routinely omitted even in written registers when the referent has been established in prior context.

The B50 convergence in Thai—both models require w = 9—demonstrates that majority-attribute coverage can be achieved through multiple structurally distinct strategies: gemma3n-e4b reaches B50 via Occupation + Traits, while Llama-3.2-3B reaches it via Occupation + Name + Gender, with substantial Income contribution (0.639 at w = 9). Practitioners deploying edge SLMs for minimal Thai person-description tasks may therefore observe less cross-model variation than those deploying for richer descriptions.

8.6. Consistency as an Architectural Signature

The 2.51-fold consistency advantage of gemma3n-e4b over Llama-3.2-3B in English, widening to 3.80-fold in Thai, presents a behavioral dimension distinct from both compression efficiency and attribute salience. This distinction matters for deployment: an edge SLM that consistently produces the same attribute coverage across trials is more predictable and auditable than one that produces variable coverage.

The widening of the consistency gap from English to Thai—driven primarily by Llama-3.2-3B’s doubling of within-cell SD in Thai—is consistent with the finding that lower training data coverage of a language increases generation stochasticity. Wang et al. [3] document that multilingual capability in SLMs degrades non-linearly as model size decreases, and Marchisio et al. [33] show that quantization differentially degrades multilingual capabilities.

The gemma3n-e4b’s near-identical consistency in English and Thai (mean within-cell SD of 0.0293 versus 0.0337) is the more striking result. The MatFormer architecture’s nested sub-model design may produce a generation process in which content selection commitments are made earlier and more deterministically in the forward pass. This interpretation is directly supported by the empirical analysis of Kudugunta et al. [25]: in the MatFormer consistency study, sub-models extracted from a MatFormer training achieve 86% prediction consistency with each other, compared with 79% for independently trained models of equivalent size.

8.7. Implications for Responsible Edge Deployment

The findings of this study have several practical implications for the deployment of edge SLMs in person-description generation tasks.

From the budget planning perspective, the B-score framework provides a principled basis for setting minimum word budgets where attribute completeness matters. For gemma3n-e4b in English, a budget of w = 17 is sufficient to achieve mean IDS ≥ 0.80. For Llama-3.2-3B in English, the corresponding budget is w = 21. In Thai, no B80 threshold is achievable for Llama-3.2-3B within the tested range. These results imply that Thai-language person-description applications should either use gemma3n-e4b or accept that Llama-3.2-3B’s Thai outputs will not achieve 80% mean attribute coverage regardless of budget allocation.

From the model selection perspective, the choice between the two models involves a trade-off. gemma3n-e4b provides higher compression efficiency, earlier threshold-crossing, greater consistency, and a lower runtime memory footprint (gemma3n-e4b [8] ~ 2 GB; Llama-3.2-3B [9] ~ 6 GB), making it well-suited for applications where budget is tight. Llama-3.2-3B achieves higher maximum IDS at large budgets in English and retains Name more strongly across all conditions.

From the fairness auditing perspective, the attribute suppression patterns documented in this study constitute preliminary evidence that edge SLMs may produce systematically incomplete descriptions along demographic dimensions even when not operating under extreme compression. The IDS framework and its attribute-level decomposition provide a methodological template for pre-deployment fairness auditing of edge SLMs in any context where attribute completeness is consequential [23,24,46].

9. Conclusions

This paper has investigated the content selection behavior of two representative edge small language models—gemma3n-e4b and Llama-3.2-3B—under systematic word-budget compression when generating person descriptions in English and Thai. Across 9360 observations collected from an 18-profile × 13-budget × 10-trial design in each of four model-language conditions, five principal findings emerge.

First, word budget is a strong and consistent predictor of output information density, with generally increasing IDS–budget relationships observed across all four conditions (Pearson r = 0.663–0.903). This establishes that content selection behavior under budget compression is not random but follows a structured, budget-responsive pattern that is measurable, reproducible, and characterizable through the IDS framework introduced in this study.

Second, gemma3n-e4b achieves each information density threshold exactly four words earlier than Llama-3.2-3B in English, a regularity that holds at B50, B70, and B80 and is partially preserved in Thai. This four-word rule represents a systematic compression efficiency advantage that is consistent across density levels and likely reflects architectural differences in how the MatFormer nested-computation design [8] commits to content selection under constraint relative to the standard transformer architecture of Llama-3.2-3B [9].

Third, Occupation is the invariant content anchor in all four conditions, with retention exceeding 0.969 regardless of model, language, or budget level. This cross-lingual and cross-architectural universality—including near-saturation at the single-word budget limit—indicates that professional role has been learned as the primary identifying dimension of a person description in both models and both languages, consistent with its high discriminative information value [16].

Fourth, each model suppresses a distinct attribute as its structural floor: gemma3n-e4b persistently deprioritizes Gender (4.8% overall retention in English, 24.2% in Thai) while Llama-3.2-3B persistently deprioritizes Education (23.6% in English, 15.2% in Thai). These suppression patterns are cross-linguistically stable within each model, form a double dissociation between the two architectures, and directly determine the natural IDS ceiling in three of the four conditions; for Llama-3.2-3B in English, Education recovery at w = 25 reduces but does not eliminate this constraint. The fairness implications of these persistent suppressions are significant for any deployment context in which Gender or Education is a required or consequential attribute.

Fifth, gemma3n-e4b is 2.5 to 3.8 times more consistent than Llama-3.2-3B across repeated generation trials, with the consistency gap widening in Thai. This difference in within-cell behavioral stability provides evidence that reproducibility of content selection is an architectural property that persists across languages.

Taken together, these findings indicate that content selection behavior under word-budget compression is determined primarily by model architecture and training data, with linguistic context playing a secondary, modulating role. Thai modifies the magnitude but not the direction of every major architectural signature identified in English: Occupation remains primary, Gemma suppresses Gender and Llama suppresses Education, Gemma is more consistent, and the four-word efficiency gap reemerges at higher density thresholds. This pattern of modulation-without-reversal supports the view that the content selection hierarchies of edge SLMs are properties of the models themselves, and that cross-lingual deployment of an edge SLM will preserve rather than eliminate those properties.

The B50/B70/B80 threshold framework introduced in this study provides a practical, interpretable instrument for characterizing minimum budget requirements in deployment planning. The attribute-level IDS decomposition provides a template for pre-deployment fairness auditing that is language-neutral by design. The open dataset of 9360 annotated outputs [21] enables future work to reproduce, extend, and refine the behavioral characterizations reported here.

10. Limitations and Future Work

This section identifies the principal limitations of the current study and outlines directions for future research.

Scope of model comparison. The present study examines two edge SLMs selected to represent distinct points in the current design space of publicly available edge-deployable models. The two-model comparison cannot support strong generalizations about the class of edge SLMs as a whole. Future work should replicate the experimental design across a broader range of edge models, including models in the 1B parameter range (such as Llama-3.2-1B), models based on different base architectures (such as Phi-3 or Qwen2.5), and models with explicitly different multilingual training balances.

Scope of language comparison. The cross-lingual design employs English and Thai as the two language conditions. The inclusion of only one non-English language limits the generalizability of the cross-lingual findings. Extending the design to languages such as Arabic (VSO word order, rich morphology, right-to-left script), Japanese (agglutinative morphology, no word spaces), or Swahili (Bantu morphology) would allow the language-specificity claims to be more precisely calibrated.

Task scope and generalizability. The present study is conducted entirely within the person-description generation task, using a fixed set of seven semantic attributes. While the behavioral principles identified here are likely to have analogs in other structured summarization tasks, the specific attribute hierarchies observed cannot be assumed to transfer without empirical verification. Future work should adapt the IDS framework to clinical and legal domains [22,23] to test cross-domain generalizability.

Profile sample and diversity. The 18 person profiles used in this study represent a limited sample of the total space of person descriptions. Future work should employ a larger profile set—on the order of 100 or more profiles—to achieve more reliable estimates of inter-profile variance.

The IDS metric and attribute granularity. The binary attribute scoring underlying IDS collapses a wide range of expression quality into a single presence/absence judgment. Future work should extend the annotation scheme to include attribute-level accuracy (is the stated value correct?) and attribute prominence (how central or elaborated is the attribute mention?). Additionally, the fixed seven-attribute operationalization may not capture culturally relevant attributes that are important in specific deployment contexts.

Single annotator per language condition. Each language condition was scored by a single native-speaker annotator. It is therefore not possible to quantify how much of the within-condition variance in attribute scores reflects genuine model behavior versus idiosyncratic annotator judgments. Future work should employ a minimum of two independent annotators per language condition, with inter-rater agreement computed using Cohen’s κ or intraclass correlation coefficients.

Generation parameter sensitivity. All outputs were generated with a fixed temperature of 0.5 and a fixed maximum token limit of 150. Future work should conduct sensitivity analyses across a range of temperature settings to determine whether the attribute suppression hierarchies documented here are temperature-independent.

Word budget versus token budget. The experimental design specifies budgets in terms of word count, but the models internally generate tokens. The relationship between word count and token count differs by language: Thai text is tokenized into more tokens per word than English due to its syllabic script and the absence of word boundaries [10]. Future work should report both word-count and token-count metrics.

Reproducibility across hardware and inference backends. All generation results were produced on a single hardware configuration and inference backend (Table 2). Marchisio et al. [33] have shown that quantization can differentially affect multilingual capabilities; future work should assess whether the attribute suppression and consistency patterns observed here are preserved under common deployment quantization schemes (INT8, INT4).

Author Contributions

Conceptualization, C.L. and K.I.; methodology, C.L.; software, K.I.; validation, C.L. and K.I.; formal analysis, K.I.; investigation, C.L.; resources, K.I.; data curation, C.L. and K.I.; writing—original draft preparation, C.L.; writing—review and editing, C.L. and K.I.; visualization, C.L.; supervision, C.L.; project administration, C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code, the controlled dataset, and the raw experimental results generated in this study are openly available in the GitHub repository at https://github.com/chacharin/slm-content-selection-behavior (accessed on 25 May 2026), cited as reference [21].

Conflicts of Interest

The authors declare no conflict of interest.

References

Minaee, S.; Mikolov, T.; Nikzad, N.; Chenaghlu, M.; Socher, R.; Amatriain, X.; Gao, J. Large language models: A survey. arXiv 2024, arXiv:2402.06196. [Google Scholar] [CrossRef]
Lu, Z.; Li, X.; Cai, D.; Yi, R.; Liu, F.; Zhang, X.; Lane, N.D.; Xu, M. Small language models: Survey, measurements, and insights. arXiv 2024, arXiv:2409.15790. [Google Scholar] [CrossRef]
Wang, F.; Zhang, Z.; Zhang, X.; Wu, Z.; Mo, T.; Lu, Q.; Wang, W.; Li, R.; Xu, J.; Tang, X.; et al. A comprehensive survey of small language models in the era of large language models: Techniques, enhancements, applications, collaboration with LLMs, and trustworthiness. arXiv 2024, arXiv:2411.03350. [Google Scholar] [CrossRef]
Zhang, Y.; Jin, H.; Meng, D.; Wang, J.; Tan, J. A comprehensive survey on process-oriented automatic text summarization with exploration of LLM-based methods. arXiv 2024, arXiv:2403.02901. [Google Scholar]
Trienes, J.; Schlötterer, J.; Li, J.J.; Seifert, C. Behavioral analysis of information salience in large language models. In Findings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 23428–23454. [Google Scholar]
Ravaut, M.; Sun, A.; Chen, N.F.; Joty, S. On context utilization in summarization with large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 2764–2781. [Google Scholar]
Lin, J.; Zeldes, A. GUMsley: Evaluating entity salience in summarization for 12 English genres. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL), St. Julian’s, Malta; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024. [Google Scholar]
Gemma Team. Gemma 3n Model Card. Google AI for Developers. 2025. Available online: https://ai.google.dev/gemma/docs/gemma-3n/model_card (accessed on 1 April 2026).
Meta AI. Llama 3.2: Revolutionizing edge AI and Vision with Open, Customizable Models. Meta AI Blog. 25 September 2024. Available online: https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/ (accessed on 12 May 2026).
Wen, Y. UnifiedCut: A simple and efficient neural model for Thai, Burmese and Khmer word segmentation. Appl. Sci. 2024, 14, 11435. [Google Scholar] [CrossRef]
Kim, D.; Lee, S.; Kim, Y.; Rutherford, A.; Park, C. Representing the under-represented: Cultural and core capability benchmarks for developing Thai large language models. arXiv 2024, arXiv:2410.04795. [Google Scholar] [CrossRef]
Zhang, T.; Ladhak, F.; Durmus, E.; Liang, P.; McKeown, K.; Hashimoto, T.B. Benchmarking large language models for news summarization. Trans. Assoc. Comput. Linguist. 2024, 12, 39–57. [Google Scholar] [CrossRef]
Pan, R. Can LLMs generate coherent summaries? Leveraging LLM summarization for Spanish-language news articles. Appl. Sci. 2025, 15, 11834. [Google Scholar] [CrossRef]
Thetbanthad, P.; Sathanarugsawait, B.; Praneetpolgrang, P. Automated redaction of personally identifiable information on drug labels using optical character recognition and large language models for compliance with Thailand’s Personal Data Protection Act. Appl. Sci. 2025, 15, 4923. [Google Scholar] [CrossRef]
Phatthiyaphaibun, W.; Chaovavanich, K.; Polpanumas, C.; Suriyawongkul, A.; Lowphansirikul, L.; Chormai, P.; Limkonchotiwat, P.; Suntorntip, T.; Udomcharoenchaikit, C. PyThaiNLP: Thai natural language processing in Python. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), Singapore, 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 25–36. [Google Scholar]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Schut, L.; Gal, Y.; Farquhar, S. Do multilingual LLMs think in English? arXiv 2025, arXiv:2502.15603. [Google Scholar] [CrossRef]
Zhao, Y.; Zhang, W.; Chen, G.; Kawaguchi, K.; Bing, L. How do large language models handle multilingualism? Adv. Neural Inf. Process. Syst. 2024, 37, 15296–15319. [Google Scholar]
Lim, Z.W.; Aji, A.F.; Cohn, T. Language-specific latent process hinders cross-lingual performance. arXiv 2025, arXiv:2505.13141. [Google Scholar]
Nemkova, P.; Adhikari, A.; Pearson, M.; Sadu, V.K.; Albert, M.V. Cross-lingual stability and bias in instruction-tuned language models for humanitarian NLP. arXiv 2025, arXiv:2510.22823. [Google Scholar]
Lertyosbordin, C. SLM Content Selection Behavior. GitHub. 2026. Available online: https://github.com/chacharin/study_slm_content_selection_behavior (accessed on 25 May 2026).
Gero, Z.; Singh, C.; Xie, Y.; Zhang, S.; Subramanian, P.; Vozila, P.; Naumann, T.; Gao, J.; Poon, H. Attribute structuring improves LLM-based evaluation of clinical text summaries. arXiv 2024, arXiv:2403.01002. [Google Scholar] [CrossRef]
Deroy, A.; Ghosh, K.; Ghosh, S. Applicability of large language models and generative models for legal case judgement summarization. Artif. Intell. Law. 2025, 33, 1007–1050. [Google Scholar] [CrossRef]
Seshadri, P.; Chen, H.; Singh, S.; Goldfarb-Tarrant, S. Small changes, large consequences: Analyzing the allocational fairness of LLMs in hiring contexts. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (IJCNLP-AACL 2025), Mumbai, India, December 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 2645–2665. [Google Scholar] [CrossRef]
Devvrit; Kudugunta, S.; Kusupati, A.; Dettmers, T.; Chen, K.; Dhillon, I.; Tsvetkov, Y.; Hajishirzi, H.; Kakade, S.; Farhadi, A.; et al. MatFormer: Nested transformer for elastic inference. Adv. Neural Inf. Process. Syst. 2024, 37, 140535–140564. [Google Scholar]
Xie, C. InfiR: Crafting effective small language models and multimodal small language models in reasoning. arXiv 2025, arXiv:2502.11573. [Google Scholar]
Mansha, I. Resource-efficient fine-tuning of LLaMA-3.2-3B for medical chain-of-thought reasoning. arXiv 2025, arXiv:2510.05003. [Google Scholar]
Xu, B.; Chen, Y.; Wen, Z.; Liu, W.; He, B. Evaluating small language models for news summarization: Implications and factors influencing performance. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Albuquerque, NM, USA, 29 April–4 May 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 4909–4922. [Google Scholar] [CrossRef]
Pham, T.M.; Nguyen, P.T.; Yoon, S.; Lai, V.D.; Dernoncourt, F.; Bui, T. SlimLM: An efficient small language model for on-device document assistance. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Vienna, Austria, 27 July–1 August 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 436–447. [Google Scholar] [CrossRef]
Garg, M.; Raza, S.; Rayana, S.; Liu, X.; Sohn, S. The rise of small language models in healthcare: A comprehensive survey. arXiv 2025, arXiv:2504.17119. [Google Scholar] [CrossRef]
Wang, X.; Dang, T.; Zhang, X.; Kostakos, V.; Witbrock, M.J.; Jia, H. HealthSLM-Bench: Benchmarking small language models for mobile and wearable healthcare monitoring. arXiv 2025, arXiv:2509.07260. [Google Scholar]
Psarra, E.; Stefanidis, K. On the applicability of LLMs and SLMs for privacy-preserving named entity recognition in financial applications. Appl. Sci. 2026, 16, 3332. [Google Scholar] [CrossRef]
Marchisio, K.; Dash, S.; Chen, H.; Aumiller, D.; Üstün, A.; Hooker, S.; Ruder, S. How does quantization affect multilingual LLMs? In Proceedings of the Findings Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; pp. 15928–15947. [Google Scholar]
Zhang, L.; Zhou, Y.; Ergen, T.; Logeswaran, L.; Lee, M.; Jurgens, D. Cross-lingual prompt steerability: Towards accurate and robust LLM behavior across languages. arXiv 2025, arXiv:2512.02841. [Google Scholar] [CrossRef]
Zhao, J.; Ding, Y.; Jia, C.; Wang, Y.; Qian, Z. Gender bias in large language models across multiple languages. arXiv 2024, arXiv:2403.00277. [Google Scholar] [CrossRef]
Shi, F.; Suzgun, M.; Freitag, M.; Wang, X.; Srivats, S.; Vosoughi, S.; Chung, H.W.; Tay, Y.; Ruder, S.; Zhou, D.; et al. Language models are multilingual chain-of-thought reasoners. arXiv 2022, arXiv:2210.03057. [Google Scholar]
Cotterell, R.; Mielke, S.J.; Eisner, J.; Roark, B. Are all languages equally hard to language-model? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Vol. 2 (Short Papers), New Orleans, LA, USA, 1–6 June 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 536–541. [Google Scholar]
Goldman, O.; Shaham, U.; Malkin, D.; Eiger, S.; Hassidim, A.; Matias, Y.; Maynez, J.; Gilady, A.M.; Riesa, J.; Rijhwani, S.; et al. ECLeKTic: A novel challenge set for evaluation of cross-lingual knowledge transfer. arXiv 2025, arXiv:2502.21228. [Google Scholar]
Gupta, A.; Joseph, R.; Rai, S. Multilingual LLMs are not multilingual thinkers: Evidence from Hindi analogy evaluation. arXiv 2025, arXiv:2507.13238. [Google Scholar] [CrossRef]
Phimsawat, O.-U. The Syntax of Pro-Drop in Thai. Ph.D. Dissertation, School of English Lit., Lang. & Linguistics, Newcastle University, Newcastle upon Tyne, UK, 2011. [Google Scholar]
Intratat, C. The invisible agent with global meaning: Thai zero anaphor subjects. In SEALS XV: Papers from the 15th Meeting of the Southeast Asian Linguistics Society; Sidwell, P., Ed.; Pacific Linguistics: Canberra, Australia, 2005. [Google Scholar]
Pathanasin, S.; Aroonmanakun, W. A Centering Theory analysis of discrepancies on subject zero anaphor in English to Thai translation. Manusya J. Humanit. 2014, 17, 45–63. [Google Scholar] [CrossRef]
Phakmongkol, P.; Vateekul, P. Enhance text-to-text transfer transformer with generated questions for Thai question answering. Appl. Sci. 2021, 11, 10267. [Google Scholar] [CrossRef]
Gemma Team; Google DeepMind. Gemma 3 technical report. arXiv 2025, arXiv:2503.19786. [Google Scholar] [CrossRef]
Jung, T.; Joe, I. An intelligent docent system with a small large language model (sLLM) based on retrieval-augmented generation (RAG). Appl. Sci. 2025, 15, 9398. [Google Scholar] [CrossRef]
An, J.; Huang, D.; Lin, C.; Tai, M. Measuring gender and racial biases in large language models: Intersectional evidence from automated resume evaluation. PNAS Nexus 2025, 4, pgaf089. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overview of the five-stage experimental pipeline. Stage 1: 18 person profiles prepared in English and Thai. Stage 2: word-budget prompt instruction. Stage 3: two edge SLMs generate summaries across 13 budget levels and 10 independent trials per cell (9360 outputs total). Stage 4: language-matched annotator scores seven binary attributes and computes IDS. Stage 5: statistical analyses produce attribute retention hierarchy per condition.

Figure 2. An example person profile with color-coded attribute spans. The Thai source sentence (top) and its English translation (bottom) are annotated with the seven binary attributes scored in this study: Name, Occupation, Gender, Trait 1, Trait 2, Education, and Income.

Figure 3. Mean Information Density Score (IDS) as a function of word budget (w) for all four model–language conditions: gemma3n-e4b English (solid teal), Llama-3.2-3B English (solid orange), gemma3n-e4b Thai (dashed teal), and Llama-3.2-3B Thai (dashed blue). Hollow circles mark the budget level at which each condition first crosses the B50, B70, and B80 thresholds. Horizontal dashed lines show the three threshold levels.

Figure 4. Overall attribute retention rates averaged across all 13 word budgets, for each of the seven attributes and all four model–language conditions. Attributes are ordered left to right by descending overall retention. Occupation is the most consistently retained attribute; Gender and Education show the greatest cross-condition variance.

Figure 5. Attribute-level retention rates as a function of word budget (w = 1 to 25) for each of the four model–language conditions: (a) gemma3n-e4b · English, (b) Llama-3.2-3B · English, (c) gemma3n-e4b · Thai, and (d) Llama-3.2-3B · Thai. Each panel shows seven lines corresponding to the seven attributes. Occupation (solid) reaches near-ceiling retention at the lowest budgets in all conditions; Gender is suppressed in the Gemma conditions while Education is suppressed in the Llama conditions, reflecting the model-specific structural floors.

Table 1. Technical specifications of the two edge SLMs.

Specification	gemma3n-e4b	Llama-3.2-3B-Instruct
Developer	Google	Meta
Parameter count	~4B (effective, MatFormer)	3.21B
Architecture	MatFormer + PLE caching	Standard Transformer (GQA)
Runtime memory	~2 GB	~6 GB (fp16)
Instruction tuned	Yes	Yes (SFT + RLHF)
Thai support	Yes [44]	Yes [9]
Release date	June 2025	September 2024
Reference	[8]	[9]

Table 2. Specifications of the hardware and software environment used in this study.

Component	Specification
GPU	NVIDIA GeForce RTX 3050 6 GB Laptop GPU
CPU	13th Gen Intel Core i7-13650HX (14 cores, 2.60 GHz)
RAM	24 GB DDR5 @ 4800 MHz (2 × 12 GB)
Operating System	Windows 10 (Build 26200.8457)
Inference backend	LM Studio 0.4.9 (OpenAI-compatible REST API)
Model serving format	GGUF (quantized, served locally via LM Studio)
gemma3n-e4b checkpoint	google/gemma-3n-E4B-it (HuggingFace Hub, GGUF)
Llama-3.2-3B checkpoint	meta-llama/Llama-3.2-3B-Instruct (HuggingFace Hub, GGUF)

Table 3. Summary statistics for all four model-language conditions. N = 2340 per condition. Pearson r and Spearman ρ report the IDS–budget correlation. Bonferroni α = 0.003846. *** p < 0.001.

Condition	Overall M (SD)	Pearson r	Spearman ρ	Inflection Budget
gemma3n-e4b × EN	0.635 (0.247)	0.878 ***	0.885 ***	w = 9
Llama-3.2-3B × EN	0.588 (0.244)	0.903 ***	0.906 ***	w = 11
gemma3n-e4b × TH	0.646 (0.242)	0.866 ***	0.866 ***	w = 9
Llama-3.2-3B × TH	0.608 (0.213)	0.663 ***	0.670 ***	w = 9

Table 4. Mean IDS per budget level across all four model-language conditions (n = 180 per cell). Asterisked values indicate the first budget meeting the B50 (≥0.50), B70 (≥0.70), and B80 (≥0.80) thresholds within each condition.

Budget (w)	Gemma EN	Llama EN	Gemma TH	Llama TH
1	0.159	0.143	0.143	0.427
3	0.225	0.272	0.354	0.379
5	0.394	0.375	0.448	0.399
7	0.502 *	0.421	0.485	0.465
9	0.614	0.496	0.564 *	0.539 *
11	0.676	0.568 *	0.677	0.590
13	0.731 *	0.637	0.744 *	0.646
15	0.756	0.679	0.737	0.698
17	0.810 *	0.734 *	0.810 *	0.703 *
19	0.839	0.775	0.844	0.739
21	0.852	0.814 *	0.862	0.764
23	0.852	0.856	0.867	0.764
25	0.841	0.878	0.867	0.796

Table 5. Minimum word budget thresholds (B-scores) for each model-language condition. NR = not reached within the tested range (maximum w = 25). Mean IDS at each threshold shown in parentheses.

Threshold	Gemma EN	Llama EN	Gap (EN)	Gemma TH	Llama TH	Gap (TH)
B50 (IDS ≥ 0.50)	w = 7 (0.502)	w = 11 (0.568)	4 words	w = 9 (0.564)	w = 9 (0.539)	0 words
B70 (IDS ≥ 0.70)	w = 13 (0.731)	w = 17 (0.734)	4 words	w = 13 (0.744)	w = 17 (0.703)	4 words
B80 (IDS ≥ 0.80)	w = 17 (0.810)	w = 21 (0.814)	4 words	w = 17 (0.810)	NR (max 0.796)	NR (lower bound: ≥8 words)
B90 (IDS ≥ 0.90)	NR (max 0.852)	NR (max 0.878)	—	NR (max 0.867)	NR (max 0.796)	—

Note: Lower bound computed as max tested budget (w = 25) minus Gemma TH B80 (w = 17). Actual gap is unknown as Llama TH did not reach B80 within the tested range.

Table 6. Overall mean attribute retention rates (all budgets combined, N = 2340 per condition). Lowest-retention attribute in each condition constitutes the structural floor.

Attribute	Gemma EN	Llama EN	Gemma TH	Llama TH
Occupation	0.994	0.980	0.969	0.969
Income	0.797	0.596	0.657	0.584
Trait 2	0.751	0.617	0.865	0.514
Trait 1	0.725	0.486	0.905	0.567
Education	0.644	0.236	0.457	0.152
Name	0.485	0.774	0.430	0.765
Gender	0.048	0.430	0.242	0.708

Table 7. (a) Attribute retention rates at budget levels w = 5 and w = 9 for all four conditions. (b) Attribute retention rates at budget levels w = 13, w = 19, and w = 25 for all four conditions.

(a)
Attribute	G_EN w5		G_TH w5		L_EN w5		L_TH w5		G_EN w9		G_TH w9		L_EN w9		L_TH w9
Name	0.994		1.000		1.000		0.911		1.000		1.000		0.994		0.972
Occupation	0.000		0.061		0.528		0.711		0.267		0.139		0.733		0.744
Gender	0.000		0.106		0.450		0.533		0.000		0.261		0.539		0.728
Education	0.217		0.006		0.044		0.067		0.672		0.206		0.139		0.139
Income	0.533		0.172		0.294		0.233		0.994		0.456		0.406		0.639
Trait 1	0.533		0.961		0.128		0.217		0.689		0.983		0.228		0.339
Trait 2	0.483		0.833		0.178		0.122		0.672		0.906		0.433		0.211
(b)
Attribute	G_EN w13	G_TH w13		L_EN w13	L_TH w13	G_EN w19		G_TH w19	L_EN w19	L_TH w19		G_EN w25	G_TH w25	L_EN w25		L_TH w25
Name	1.000	1.000		0.989	0.967	1.000		1.000	0.972	1.000		0.994	1.000	1.000		1.000
Occupation	0.511	0.417		0.989	0.878	0.939		0.900	1.000	0.844		0.867	0.872	0.994		0.878
Gender	0.061	0.361		0.461	0.806	0.067		0.278	0.467	0.772		0.139	0.356	0.528		0.761
Education	0.833	0.561		0.056	0.133	0.867		0.783	0.300	0.189		0.883	0.867	0.800		0.267
Income	1.000	0.933		0.644	0.678	1.000		0.989	0.922	0.717		1.000	0.983	0.983		0.783
Trait 1	0.806	0.933		0.622	0.594	1.000		0.961	0.783	0.850		1.000	0.989	0.850		0.939
Trait 2	0.911	1.000		0.694	0.467	1.000		1.000	0.978	0.800		1.000	1.000	0.989		0.944

Table 8. Within-cell IDS consistency across all 234 cells per condition (10 trials per cell).

Condition	Mean Within-Cell SD	Max Within-Cell SD	Worst Cell	Ratio to Gemma
gemma3n-e4b × EN	0.0293	0.1421	P2, w = 3	1.00× (reference)
Llama-3.2-3B × EN	0.0735	0.2793	P1, w = 15	2.51×
gemma3n-e4b × TH	0.0337	0.1622	P3, w = 15	1.00× (reference)
Llama-3.2-3B × TH	0.1279	0.2694	P16, w = 13	3.80×

Table 9. Profile-level mean IDS variation across 18 profiles per condition. †Llama EN inter-profile SD estimated from observed range.

Condition	Min M (Profile)	Max M (Profile)	Range	SD Across Profiles
gemma EN	0.567 (P6)	0.723 (P14)	0.156	0.038
Llama EN	0.514 (P7)	0.669 (P14)	0.155	0.038
gemma TH	0.543 (P11)	0.784 (P4)	0.241	0.062
Llama TH	0.467 (P13)	0.708 (P15)	0.241	0.059

Table 10. Consolidated cross-condition summary for all key metrics. NR = not reached within tested range. Consistency ratio = Llama within-cell SD/Gemma within-cell SD within same language.

Metric	Gemma EN	Llama EN	Gemma TH	Llama TH
Overall M (SD)	0.635 (0.247)	0.588 (0.244)	0.646 (0.242)	0.608 (0.213)
Pearson r	0.878	0.903	0.866	0.663
B50	w = 7	w = 11	w = 9	w = 9
B70	w = 13	w = 17	w = 13	w = 17
B80	w = 17	w = 21	w = 17	NR
Inflection budget	w = 9	w = 11	w = 9	w = 9
Floor attribute	Gender (0.048)	Education (0.236)	Gender (0.242)	Education (0.152)
IDS ceiling (obs.)	0.852	0.878	0.867	0.796
Plateau onset	w = 21	w ≥ 25	w = 23	None in range
Mean within-cell SD	0.0293	0.0735	0.0337	0.1279
Consistency ratio	—	2.51×	—	3.80×
Inter-profile SD	0.038	~0.038†	0.062	0.059

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lertyosbordin, C.; Iamruttanawong, K. Content Selection Behavior of an Edge Small Language Model Under Word-Budget Compression: A Cross-Lingual Study of English and Thai Person Descriptions. Appl. Sci. 2026, 16, 5754. https://doi.org/10.3390/app16125754

AMA Style

Lertyosbordin C, Iamruttanawong K. Content Selection Behavior of an Edge Small Language Model Under Word-Budget Compression: A Cross-Lingual Study of English and Thai Person Descriptions. Applied Sciences. 2026; 16(12):5754. https://doi.org/10.3390/app16125754

Chicago/Turabian Style

Lertyosbordin, Chacharin, and Krittitee Iamruttanawong. 2026. "Content Selection Behavior of an Edge Small Language Model Under Word-Budget Compression: A Cross-Lingual Study of English and Thai Person Descriptions" Applied Sciences 16, no. 12: 5754. https://doi.org/10.3390/app16125754

APA Style

Lertyosbordin, C., & Iamruttanawong, K. (2026). Content Selection Behavior of an Edge Small Language Model Under Word-Budget Compression: A Cross-Lingual Study of English and Thai Person Descriptions. Applied Sciences, 16(12), 5754. https://doi.org/10.3390/app16125754

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Content Selection Behavior of an Edge Small Language Model Under Word-Budget Compression: A Cross-Lingual Study of English and Thai Person Descriptions

Abstract

1. Introduction

2. Research Questions

3. Research Objectives

4. Contributions

5. Literature Review

5.1. Text Summarization, Content Selection, and the Problem of What Gets Left Out

5.2. Small Language Models at the Edge: Capabilities, Constraints, and Behavioral Gaps

5.3. Cross-Lingual Behavior: What Changes When the Language Changes

5.4. Thai Natural Language Processing: Why Thai Is a Meaningful Test Case

5.5. Toward a Behavioral Characterization of Edge SLM Content Selection

6. Methodology

6.1. Experimental Design Overview

6.2. Person Profile Construction

6.2.1. Profile Set

6.2.2. Budget Level Selection

6.3. Model Selection and Technical Specifications

6.4. Text Generation Procedure

6.4.1. Prompt Design

6.4.2. Generation Settings

6.5. Information Density Score: Definition and Operationalization

6.5.1. Conceptual Grounding

6.5.2. Attribute Operationalization

6.5.3. IDS Computation

6.6. Statistical Analysis Plan

6.7. Generation and Annotation Pipeline

6.8. Hardware and Software Environment

7. Results

7.1. IDS–Budget Relationships (RQ1)

7.1.1. Overall Magnitude and Monotonicity

7.1.2. Per-Budget Mean IDS

7.1.3. Steepest Gain Intervals and Inflection Points

7.1.4. Non-Monotone Steps

7.2. Minimum Budget Thresholds: B-Score Analysis (RQ2)

7.3. Attribute Salience Hierarchies (RQ3)

7.3.1. Overall Retention Rates

7.3.2. Occupation as the Invariant Content Anchor

7.3.3. Model-Specific Suppression: Gender (Gemma) and Education (Llama)

7.3.4. Budget-Stratified Attribute Retention

7.4. Within-Trial Consistency and Profile-Level Variation (RQ4)

7.4.1. Within-Cell Consistency

7.4.2. Profile-Level Variation

7.5. Cross-Model and Cross-Lingual Comparison (RQ5)

7.5.1. The English 4-Word Rule

7.5.2. Thai: B50 Convergence and Widening Gap at Higher Thresholds

7.5.3. Attribute Salience: Convergence and Divergence

7.5.4. Consistency: Results Summary

7.5.5. Consolidated Cross-Condition Summary

8. Discussion

8.1. Budget as the Primary Driver of Content Selection Behavior

8.2. Compression Efficiency and the 4-Word Rule

8.3. Occupation Primacy as a Universal Content Anchor

8.4. Attribute Suppression Hierarchies and Their Implications

8.5. Cross-Lingual Modulation of Content Selection Behavior

8.6. Consistency as an Architectural Signature

8.7. Implications for Responsible Edge Deployment

9. Conclusions

10. Limitations and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI