1. Introduction
Generative AI for natural language tasks is currently dominated by autoregressive decoder-only transformers [
1]. These models can memorise and compress trillions of tokens of training data as they store factual associations encoded in their weight matrices. However, memorisation of the factual knowledge requires a high number of parameters, so that the model can learn the basic grammar and syntax of possibly multiple languages, style of writing, multiple formats and the actual world knowledge [
2]. This type of model requires a full pre-training to update the world knowledge and extend it with completely new information, which was unknown to the model before. The training process takes significant computational resources and requires lots of filtered and cleaned data for a successful training run. According to the Chinchilla scaling law, the model has to get at least 20 tokens per one trainable parameter to not end up underfit [
3].
Researchers proved that factual associations are not encoded across the whole model, and they are usually localised into clusters inside feed-forward networks (FFNs) of transformer blocks [
4]. They proposed the term “knowledge neurons” to label parts of the model responsible for recalling specific information. However, even if the factual knowledge is localised, it is still impossible to update and expand it without a full retraining. Even methods like ROME (Rank-One Model Editing) cannot inject a completely new memory or factual knowledge into the model [
5]. This approach locates specific neurons in the FFN that encode a representation of a certain learned association. Then, it conducts a targeted update of these weights instead of a full retraining of the whole network. For example, if the model has learned the phrase “Kiev is the capital of Ukraine”, then ROME would be a suitable method to update it as “Kyiv is the capital of Ukraine”. However, if we try to inject a completely new fact like “Gemini 3 Pro is available starting some date”, then ROME can struggle as there is no preexisting embedding for an entity like “Gemini 3 Pro”. ROME can be scaled to thousands of factual updates by applying Mass Editing Memory in a Transformer (MEMIT), but even this way, it still requires meaningful subject activations to find which weights to edit [
6].
Another method would be to apply LORA (Low-rank Adaptation) to fine-tune a small portion of the model [
7]. LORA adds an adapter layer to the model so that it can stay frozen, and only some linear layers can be fine-tuned with some new data. This limits the number of trainable weights to less than 10% of the whole model. However, LORA does not guarantee knowledge preservation and can lead to catastrophic forgetting by overwriting already learnt information. It mostly excels at task and style adaptation instead of factual association injections, as it edits just a fraction of weights [
8].
Therefore, the task of conditional text generation relies on in-context learning with semantic or keyword retrieval of the contextual knowledge [
9]. This approach is called retrieval augmented generation (RAG) [
10]. It injects new data into a pre-trained LM to condition it for a specific task or domain by passing new knowledge as input tokens into the decoder context window. While it grounds the generation with source data, the model can still misinterpret retrieved context, give higher priority to its already learned knowledge from training or merge learned facts with injected ones incorrectly.
However, RAG or fine-tuning cannot restrict the model to only a specific set of facts and knowledge. It can still access the information learned in pretraining in both cases (unless catastrophic forgetting has occurred), which is useful for many tasks that require a combination of both the world and specialised knowledge. This way, the model cannot be fully limited to just one domain or even some localised, personal knowledge base to avoid bringing bias from the training data [
11].
Another issue with decoder-only conditional language modelling is the interpretability of the model and the explainability of its predictions [
12]. Predictions cannot be directly traced back to the input context, and currently, there is no way to link every generated token to either instruction, learned knowledge, or the one passed through in-context learning.
Given these issues, the solution has to lie in not just training data and parameter count scaling or in-context learning extended with retrieval systems, but also in the architectural design. The present study applies the encoder–decoder architecture to solve the task of reliably restricted conditional language modelling. First, it proposes a modular architecture called Knowledge Injected Transformer (KIT). It consists of a pretrained encoder and decoder, which are merged into a single model. The decoder is initially trained to be fluent in English without memorising factual associations, while the encoder is tasked with representing the injected information according to the request of the decoder to predict the next token. This approach should avoid creating a large model, where most of the weights are dedicated to memorisation, as the encoder and its hidden states should replace the knowledge neurons of the decoder-only generative transformers. Second, it introduces a new synthetic dataset for the question answering task with more than 200 thousand triplets with context, question and answer. It compares the trained encoder–decoder model with already existing decoder-only and encoder–decoder solutions. The current research continues our earlier experiments on the controllability of generative models and efficient usage of the model context window to reduce the number of tokens passed to it.
2. Current Research Analysis
Currently, the main way to improve the model performance is to scale the number of parameters and training data, as it provides more examples of downstream tasks and specialised domain knowledge. However, increasing the model size leads to increased inference cost and complexity of pretraining. Recent advances in generative language modelling have seen attempts to optimise inference to make the models more available and to reduce the cost of the model update with new knowledge [
13]. Labs search for ways to distil the models without losing their core capabilities like reasoning, fluency, consistency and memorised world knowledge. At the same time, the search continues for efficient methods on how to inject new knowledge and update the already learned pieces.
The growth of the number of parameters has to be accompanied by a growth in training data samples. The study on the Chinchilla model family proves the critical importance of balancing model size and the training dataset size to get a fit model [
14]. Authors trained hundreds of transformer decoders over a wide range of parameter counts (from tens of millions to 70 billion) and dataset sizes (from billions to hundreds of billions of tokens). The study derives a scaling law, named after the models, by which the efficient model has to be trained on at least 20 times more tokens than it has parameters to minimise loss and maximise the downstream performance. They derived an insight that scaling parameters alone limit the effectiveness of large language models (LLMs) as they stay underfit. Also, it shows that a smaller model with more efficient pretraining can overperform in comparison to larger counterparts. Subsequent research proves that this ratio can be increased and small models can be trained on even higher numbers of tokens to improve their downstream performance and reduce the inference cost by replacing larger decoders with more efficient small ones [
15]. However, the Chinchilla scaling law remains foundational to the optimisation of pretraining runs.
This way, a larger model would require at least twice as many tokens as it has parameters to get retrained to update its knowledge base. Training cost and computational complexity remain a central bottleneck for the adoption and creation of custom local models, which motivates the research on more efficient training paradigms. Li et al. propose the LOST (Low-rank and Sparse Pre-Training) framework, which integrates low-rank and sparse parameter structures during the pre-training process to significantly reduce memory and computational overhead compared with conventional full language modelling training [
16]. LOST demonstrates promising results on a multitude of model sizes from 60 million to 7 billion parameters, while lowering compute and memory demands.
There is also a whole research field on decoder transformers’ interpretability, controllability and explanation of their weight structure. These studies try to find out how exactly such models encode the knowledge, syntax and grammar and whether they require a large size to correctly reproduce a coherent and fluent natural language. They also try to locate factual relations among the weight matrices to identify them and allow further updates and corrections, or at least interpret the behaviour of the model.
He et al. introduce decoding probing [
17]. This is a method for the analysis of transformer model internal representations. It treats the hidden states of the model as pseudo “neural activations” and links various learned linguistic features to layers of the model. The decoding probing uses the Benchmark of Linguistic Minimal Pairs (BLiMP) to decode memorised grammatical constructions from intermediate representations [
18]. The approach lies in training a lightweight classifier for every block of the decoder transformer on its hidden states to distinguish between grammatical and ungrammatical minimal pairs. A minimal pair is a pair of sentences which differ by a small, controlled change (often word or structure) designed to isolate a specific linguistic feature. The study used GPT-2 for experiments, and it shows that a self-supervised language model captures abstract syntax and grammar in the first third of the blocks. These syntactic representations can be distributed to further layers as the complexity of the language grows. This research even proves that syntactic relations are easier to extract than the morphological and semantic features. Both embeddings and attention mechanisms contribute to encoding complex grammar. These findings suggest that grammatical structure emerges early and is mostly contained in the first blocks of the model.
Dai introduces the concept of knowledge neurons [
4]. They show that the factual associations in pretrained transformers are not sparsely encoded across the whole model, but they are actually localised in subsets of neurons in FFNs of transformer blocks. The authors propose to selectively suppress or activate individual neurons and examine how it directly affects the ability to recall specific facts. This way, they prove that these subsets of neurons store factual knowledge instead of grammar, morphology, and syntax. The study also determines the distribution of such knowledge neurons across the network. They are rare in the first blocks of the decoder and are mostly concentrated in the middle and last layers of the model. Attention mechanisms are shown to play a secondary role in storing actual content, as it is primarily done by FFNs. The authors prove that targeted modifications of singular FFN can update memorised facts of the model.
ROME (Rank-One Model Editing) proposed by Meng et al. identifies these clusters of knowledge neurons and applies a low-rank modification to the FFN weight matrices, which corresponds to the fact that has to be changed [
19]. ROME enables a targeted weight modification, which minimally affects unrelated knowledge and allows for updating the model without full retraining.
MEMIT (Mass-Editing Memory in Transformers) builds on this idea and provides an ability to conduct batch updates of the facts [
20]. Both MEMIT and ROME leverage the insight that factual knowledge can be localised to certain neurons and demonstrate a careful way to update them without causing catastrophic forgetting, offering an alternative to a full repetition of the pretraining process. However, both methods can lead to a collapse of the model and catastrophic forgetting in the case of inaccurate weight editing [
5].
The TinyStories research investigates how small language models can generate coherent English text when trained on carefully designed synthetic datasets [
21]. This study focuses on the minimally viable model, which can be fluent in English without factual knowledge. Authors train multiple decoders based on GPT-Neo architecture and tokeniser up to 33 million parameters, which is smaller than a base GPT-2 (150 million parameters) [
22]. The primary contribution of this work is a synthetic dataset with more than 2 million records generated by GPT 3.5 and GPT 4 [
23,
24]. The dataset employs a restricted vocabulary of approximately 1500 words. Every record is a story, which should be easy enough for a 4-year-old child to understand. It emphasises diverse narrative structures, grammatical and syntactic constructions. The models trained with this dataset are decoder-only transformers with context windows up to 512 tokens trained on a causal language modelling objective. Models get evaluated by GPT 4, measuring the level of fluency, grammar, creativity and coherence. Empirical findings indicate that even models with around 10 million parameters can generate a coherent multi-paragraph text. Increasing the model size improves creativity and complexity, but the basic grammar is captured even with just 8 million parameters.
Encoder–decoder models can be a different paradigm to approach the problem of computationally efficient models with an updatable knowledge base. They are especially effective when precise input comprehension, strong input–output alignment, and tasks requiring reasoning over long or structured contexts are critical.
The interest in encoder–decoder models has resurged with the release of the Gemma T5 model by Google [
25,
26]. This model aims to combine the strong generative capabilities of the decoder-only models like Gemma or Gemini with the power of bidirectional representations created by a transformer encoder. Gemma T5 initialises both encoder and decoder from the same pretrained Gemma weights. Architecturally, Gemma T5 follows the canonical Transformer encoder–decoder layout: the encoder processes inputs bidirectionally to produce rich representations, and each decoder layer attends over both its own past outputs and the encoder’s outputs via cross-attention mechanisms, enabling effective conditioning on input context for various tasks such as summarisation, translation, and reading comprehension. T5Gemma models have demonstrated notable improvements on benchmarks requiring deep semantic understanding and reasoning, such as GSM8K [
27] (math reasoning) and reading comprehension tasks.
The current research landscape shows that the problem of updatable knowledge and memory is still not solved and remains crucial to the design of modern LLMs as it makes their continuous support more expensive and difficult in terms of compute, memory usage and data gathering. Moreover, the size of LLMs grows with every new generation, which leads to an increase in demand for resources and makes the problem of continuous support even more concerning. Current studies propose ways to locate neurons responsible for certain knowledge, and they can even update them accordingly, but they lack the capabilities to inject whole new sets of facts and restrict the model reliably to just one domain without the training dataset knowledge leak.
The present study introduces a modular encoder–decoder architecture for building an efficient and updatable language model suitable for local deployment. Our key contribution lies in training the decoder as a knowledge-agnostic conditional generation engine that focuses solely on grammatical fluency, while factual knowledge is externalised and injected through the encoder. Experiments show that both the base decoder and its instruction-tuned variant generate grammatically correct but factually irrelevant outputs, as they lack any internal world knowledge or ability to conduct in-context learning and tend to default to storytelling or summarisation patterns. In contrast, the proposed KIT model significantly improves factual correctness, relevance, and completeness while maintaining coherent language generation. These results validate the core idea that knowledge representation and language generation can be effectively decoupled: the encoder integrates factual context, and the decoder focuses on surface form. This architectural separation provides a principled foundation for efficient future knowledge updates, where only the encoder must be retrained without modifying the decoder. At the same time, such a split allows for reduced model size as knowledge neurons can be replaced with an external knowledge base, which gets used during inference to restrict and condition the decoder.
3. Materials and Methods
Section 3 provides a detailed model architecture description and a full overview of the conducted experiments planning. We start with the definition of the proposed encoder–decoder model. Then, the data gathering, generation and filtering are described. The synthetic dataset generated for this study is made publicly available on the Hugging Face platform (
https://huggingface.co/datasets/turuta/QA-generated, accessed on 20 December 2025). Finally, we provide the methodology for the training and evaluation.
3.1. Model Architecture
We propose an architecture called Knowledge Injected Transformer (KIT), which is depicted in
Figure 1. It is designed to allow context and factual knowledge injection without new training or filling of token slots in the decoder context window. This is achieved by merging 2 already pretrained transformer models: a bidirectional encoder and a causal autoregressive decoder.
N in
Figure 1 marks the number of decoder blocks, while M is the number of encoder blocks. We left only 1 decoder block without cross-attention in this experiment, as the model had just 4 blocks. The number of blocks without context injection via cross-attention can be extended for models with a higher number of blocks, and we propose to keep at first 25% of decoder blocks. We suggest that for bigger models with a higher number of blocks, approximately the first 25% of blocks can remain free of cross-attention to keep initially learned language and grammar fluency intact. The remaining 75% of the decoder blocks get a cross-attention injection to condition predicted tokens.
The encoder would be initialised from the ModernBERT model by HuggingFace (New York, NY, USA), so it has 149 million parameters and 22 transformer blocks with the dimension of 768 [
28]. The size of the encoder context window is 8192 tokens. There is no task-specific head for the encoder model, as it is used as a context source, and the encoder hidden state gets passed to the decoder blocks for the conditional token generation.
Encoder hidden states are passed to the projection network at first, which consists of a linear layer, layer normalization and a skip connection between normalised projected hidden states and original ones from the encoder. This should adapt encoder outputs to the distribution learned by the decoder to mitigate the distribution shift between two independently trained models. The projection network can be expressed with the following equation:
where
stands for encoder hidden states,
is the projected version of encoder hidden states and
is the weight of the linear layer used to align encoder and decoder tensors. The skip connection ensures that gradients can flow effectively back to the encoder, while its hidden states get adjusted to the expected distribution of the decoder.
In KIT, the encoder can be viewed as performing implicit, task-driven compression of factual inputs. Rather than applying an explicit compression algorithm, the Transformer encoder maps structured or semi-structured facts into fixed-dimensional dense representations that act as an information bottleneck. Through attention-based aggregation and dimensionality constraints, the encoder selectively retains information that is relevant for downstream generation while discarding redundant or task-irrelevant details. We refer to the encoder representations as an information bottleneck in the practical sense that all factual input must be mapped into a fixed-dimensional latent space optimised for downstream generation. The encoder does not get any signal from the decoder on the task, so it learns to extract and represent the most noticeable facts from the input knowledge during the joint training stage. In this sense, compression emerges as a consequence of representation learning optimised for conditional generation, rather than as a lossless or symbolic encoding of facts.
The encoder produces dense representations that are not directly interpretable. Factual associations are encoded in a distributed manner across the latest space instead of human-interpretable structures, which allows the encoder to effectively and flexibly represent complex knowledge. However, it would be impossible to directly verify the correctness of such representations. The correctness of encoder representations is ensured indirectly through the training signal during joint optimisation, which in this case is the causal language modelling objective. During training, encoder outputs are optimised to provide sufficient and accurate information for the decoder to generate correct answers, as any omission or distortion of factual content leads to increased generation loss. Because the decoder is pretrained to be knowledge-agnostic, it cannot compensate for incorrect or incomplete encoder representations, thereby forcing factual accuracy to be encoded explicitly. The model is also evaluated with the GPTscore [
29], which is defined in
Section 3.3, where one of the key components is the correctness of the generated answer.
The decoder is based on the GPT-Neo trained on the TinyStories dataset with just 33 million parameters and 4 transformer blocks with the dimension of 768 [
21,
30]. As elaborated earlier, TinyStories paper aimed to create the smallest LM fluent in English, so this 33 million parameter GPT-Neo should be capable of writing fluent and grammatically correct short stories in English similar to children’s tales. However, it was not explicitly trained on corpora containing extensive factual knowledge, and it cannot memorise as much information as multibillion-parameter LLMs. This way, we start with a model that already knows how to write a correct and coherent English text and inject cross-attention layers into blocks 2, 3, and 4, leaving out block 1. Block 1 of the TinyStories GPT-Neo model is responsible for the basic grammar and understanding of English, so this way we try to recreate the already mentioned distribution from current LLMs: the first block learns grammar and language mostly, while the middle blocks are responsible for factual knowledge, and the last blocks are tasked to memorise style and formatting [
4]. This way, block 1 remains completely unchanged and still has the following structure: Self-Attention, Layer Normalization and a Feed Forward Network. Blocks 2, 3 and 4 follow a structure similar to BART [
31] with a Self-Attention, Layer Normalisation, Cross-Attention, another Layer Normalisation, and a Feed Forward Network. Cross-attentions use decoder hidden states as queries and projected encoder hidden states as keys and values, which allows the decoder to query necessary context from projected encoder representations. The causal language modelling head remains the same, and its weights are still tied to the token embeddings layer. The tokeniser dictionary and embeddings also remain unchanged.
Such a configuration saves space in the decoder context window for the actual conversation history or task definitions and keeps the decoder adjustable for new knowledge without any retraining, as it can gather context from the encoder dynamically. Also, the context can be changed from token to token to restrict the model to a certain source for every section of the generated content. This increases the controllability and interpretability of the model, as every generated token can be traced to the source, and it can be restricted to a small custom knowledge base without mixing pretraining and injected facts.
The forward method would be defined the following way:
Encode the context with the encoder network and obtain the encoder hidden states tensor.
Project the encoder hidden states into the decoder learned distribution;
Apply token and positional embeddings for the decoder inputs, which should provide a question or a task definition.
Pass embedded tokens summed with positional embeddings to the first N block of the decoder with only the self-attention (these blocks do not have cross-attention and cannot query the encoder for any context);
Pass decoder hidden states to the next blocks with cross-attention, which injects projected encoder hidden states into the generation process and conditions it to use the encoded knowledge;
Pass the final decoder hidden states tensor into the language modelling head and generate the probability distribution of the next token in the sequence.
During joint training, the decoder learns to utilise the encoder’s latent representations through standard conditional language modelling. At each token prediction, the decoder attends to the encoder outputs via cross-attention and is optimised to assign the next token probabilities given both previously generated tokens and the encoder-provided context. Errors in factual content or relevance directly increase the generation loss, providing a dense supervision signal that encourages the decoder to attend to and extract the relevant information from the encoder representations. No additional alignment, guiding objectives or sampling restrictions are required, as the causal language modelling loss implicitly trains the decoder to associate specific latent patterns with correct output tokens. The KIT model is trained centrally, and its modular architecture is aimed solely at disentangling grammar and syntax understanding from the factual knowledge, so the concept of split learning would not be applicable to it [
32].
The separation of grammatical competence and factual knowledge in KIT is achieved through a modular training strategy. The decoder is pretrained as a domain-agnostic generative model using data that deliberately excludes factual or domain-specific information, encouraging it to focus exclusively on syntactic structure, linguistic fluency, and general generation dynamics. In contrast, the encoder is trained on a broad range of topics to learn compact representations of factual and contextual knowledge. After integration into the unified KIT architecture, the encoder conditions the decoder by providing knowledge representations, while the decoder is responsible solely for generating a grammatically coherent response conditioned on these representations. As a result, factual associations are externalised from the decoder and supplied at inference time via the encoder, enabling the decoder to remain lightweight and parameter-efficient without sacrificing access to knowledge.
3.2. Training Data
We gathered 498,182 context, question, and answer triplets to train the KIT model. The training dataset parts and distribution can be seen in
Table 1.
Context token count was measured using the ModernBERT tokeniser, while question and answer token counts were obtained with the TinyStories GPT-Neo tokeniser. This was done so that context would be passed only to the encoder, and the question with the answer would be passed only to the decoder part of the model.
The RepliQA [
34] dataset was created as a validation dataset for LLMs, but we used it to train our model as it had almost 90,000 structured records with long and short versions of the answer. We used the long version as the context and the short one as the actual answer the model should generate during training. It consists of 5 independent parts, where every part is approximately 17,500 records. We used 4 parts of this dataset for training.
SciQ [
36] provides crowd-sourced science exam questions on Physics, Chemistry and Biology. The dataset is structured for multi-choice question answering, but we used only the correct answer, question and the support paragraph with the explanation of the grounding of the answer.
PsiloQA [
35] and SQUAD [
33] are similar in their structure as they both use passages with context and a set of question-answer pairs for every passage. Questions mostly have an extractive nature where the model has to find a part of the context, which can be used as an answer.
The validation dataset used during training for validation loss measurement consists of 1 final, part of the RepliQA [
34] and validation sets of SciQ [
36] and PsiloQA [
35]. This gives us 22,305 validation samples.
Table 2 provides a more detailed description of the validation dataset.
We want to specifically focus on the newly developed synthetic dataset, which was used to train KIT. It was generated using gpt-5-nano model (trained date: May 2024) with reasoning effort settings set to the value “minimal”. We used the seeding method to create question–context–answer triplets from the SAMSum and TinyStories datasets. Temperature and top-p were kept high at 1.0, both to keep the sampling and generation more diverse. Such sampling parameters should make the generation more natural and make it cover more different linguistic styles.
The SAMSum corpus contains 16,000 messenger-like conversations developed manually by linguists [
37]. They imitate multiple styles and cover different topics, so chats can be formal, informal or semi-formal. They contain slang words, typos and emojis.
TinyStories paper was already mentioned before, but here we used the original dataset only as a source of context injection for the KIT model, so it gets passed into the encoder part, which has never seen this data before. So, even though the decoder was pretrained with the TinyStories dataset originally, it has to generate completely new answers based on stories from this dataset, while the encoder has to represent them as a context for this answer. TinyStories contains 2 million small tales and stories, but we used only 10,000 of them for dataset generation due to the time complexity of the synthetic data generation.
As was said earlier, we used the seeding method to create these synthetic triplets, which means that original texts from both datasets were passed to the gpt-5-nano model as a source, and then the model was tasked to generate multiple unique questions and answer pairs, where the answer can be derived from the context (original content) [
38]. The model was tasked to generate both general questions about the mentioned events (their reasons, consequences, how and when they happened) and more extractive questions, where the answer would be a name of the person. Location, specific things or timestamps are mentioned in the seed content. The model generated up to 10 question-answer pairs for every provided piece of context.
We used the following prompt for synthetic question answering generation: “Role: You are an expert instructional designer and content analyst. Task: Create a concise flashcard-style FAQ based strictly on the provided context. Instructions. Extract Key Facts: Identify the most important concepts, dates, names, or processes within the text. Question Design: Write unique questions that are direct and specific. Avoid vague phrasing. Answer Design: Answers must be brief (ideally one sentence) and should not include “fluff” or introductory phrases like “According to the text...” Strict Grounding: Use only the information provided in the context. If the answer isn’t there, do not invent it. Context: <context>{context}</context>”. We used role prompting and a zero-shot learning approach for this task. Context would be the source text (either a story from TinyStories or a dialog from SAMSum). We marked it with an XML tag to highlight it for the model and separate it from the instruction itself. The output structure is a list of JSON dictionaries with 2 keys: “q” (question) and “a” (answer).
The next step was the filtering of generated samples. We refused to filter data by length or token count as answers were specifically prompted to be short (no small talk, explanation or introduction, only the actual answer). The first stage of the filtering was deduplication of question–answer pairs, which was not common (less than 1% of records), as every seeded context piece is a unique dialogue or story. Then, we compared the question and the context, and the answer and the context by cosine similarity using sentence embeddings obtained from the Gemini Embeddings 001 model by Google DeepMind with the “RETRIEVAL_DOCUMENT” task type and 768 embedding size [
39]. If both the question–context and answer–context pairs score cosine similarities higher than 0.9, then this sample would be left in the dataset. This method can only verify that the generated pair is similar to the seeded context. However, it helped to filter hallucinations and deviations from the provided source material without human evaluation or LLM-as-judge verification, which would require more time.
As a result, we have generated 247,608 valid entries to train a model for the question answering task using a provided context as a source. The dataset is fully open-sourced via HuggingFace Hub and can be used freely for further research and model training.
Table 3 provides more details on the proposed synthetic dataset.
This gives 525,218 training samples and 22,305 evaluation samples in total, which were used to conduct experiments in the current research.
3.3. Training Methodology
The KIT model uses pretrained encoder and decoder models as a foundation. This means that the GPT-Neo decoder was already pretrained on 476.6 million tokens on the causal language modelling task. As we used a 33 million parameter version of the model, it means the model could be trained with an even higher number of training tokens to comply with the Chinchilla scaling law. The current ratio of tokens to parameters during pretraining of TinyStories GPT-Neo 33M is 14.4, while the Chinchilla scaling law recommends using 20 tokens per parameter during training. ModernBERT encoder with 150 million parameters is stated to be trained on 2 trillion tokens of English and code, which gives 13,333 tokens per parameter.
Then, we have to train both models to work as a part of one architecture, where the encoder would provide conditions and context for the generative decoder. The context would be passed directly to the encoder without any preprocessing or formatting. The question and answer get formatted into the following XML string template: “<question>{question}</question><answer>{answer}[EOS]”. The question is always encapsulated with XML tags, and the generation has to stop once the answer is complete, so no need for a closing tag there. Several training approaches were applied during our research on the KIT model as we tried to find the best one for this task:
Freeze both encoder and decoder for the first part of the training. Unfreeze only the projection network and train the KIT model on 25% of the training data for 1 epoch. Then, unfreeze decoder blocks with cross-attention layer (blocks 2–4) and train the projection network and the decoder blocks 2–4 for another 3 epochs on full data. The aim was to keep English fluency and avoid breaking the model by keeping the first block of decoder and token embeddings tied with the language modelling head frozen. As stated earlier, the first 25% of decoder blocks are left without cross-attention, so in this setup, all blocks without cross-attention should remain frozen. The first step uses a higher learning rate (5 × 10−5 in our case) because the projection network is a set of new layers, which were not trained earlier. Then, we lower the learning rate to 5 × 10−6 for the second step to avoid changing weights significantly and potentially breaking pretrained models.
Freeze the whole model. Unfreeze the last N blocks of the encoder (in our case, the last 5 blocks out of 21 blocks of the ModernBERT) and the projection network, and train them on 25% of the training data for 1 epoch. The idea is to align distributions between the encoder and decoder before the actual training starts. Then, the next step would be to unfreeze all decoder blocks with cross-attention and train the last N encoder blocks, the projection network and decoder blocks with cross-attention layers for 3 epochs on full training data. Token embeddings and first decoder blocks without cross-attention are kept frozen in this setup. The learning rate for the first step should be 5 × 10−5, and for the second step, it should be lower (we used 5 × 10−6).
The last approach would be to freeze the encoder and then unfreeze only the last N blocks (5, like in the second approach). The decoder and projection network are fully unfrozen for this setup. The network gets trained end-to-end on the full training dataset for 3 epochs. This way, we help the decoder to quickly adapt to using encoder hidden states and keep most of the encoder unchanged by training only the last N blocks to align its outputs with the expected distribution of the decoder. The learning rate would be 5 × 10−5 for this setup, and there is only one stage here.
This approach would be the same as the third one, but we add new special tokens to the decoder tokeniser for tags <question>, </question>, and <answer>. This should reduce the number of tokens the model spends on formatting the content.
The final method would be the same as the fourth and third, but we mask the loss for the question part of the text decoder that has to generate. This should help the model learn how to better generate the actual answer instead of generating the question.
Methods 1 and 2 were inspired by the training methodology of Vision LMs, where both the Vision Transformer and the generative decoder are trained separately at first, and then they are connected via a projection layer and get trained together, while components are gradually unfrozen. The core concept of these methods is to tune only new layers or the last blocks of the encoder to reduce the number of modified parameters during backpropagation, as it saves memory and computation budget, as well as makes the training faster. Another idea behind this approach is to prevent catastrophic forgetting in the already trained decoder. The model could potentially adapt representations to the task without disturbing the decoder’s generative behaviour.
Training loss gets logged every 100 steps, evaluation happens every 5000 steps and model weights are saved every 4000 steps.
Table 4 describes the software and hardware environment used in this research, along with the training hyperparameters:
3.4. Evaluation Methodology and Metrics
The evaluation during the training is conducted only by measuring the validation loss. However, once the training is complete, we evaluate the proposed KIT model on a subset of the RepliQA benchmark, using one-fifth of the dataset for evaluation. This subset contains 17,950 triplets consisting of context, question, and answer.
We benchmark it against multiple other language models, such as GPT-5-Nano and Gemma-T5. GPT-5-Nano was selected as it represents the smallest publicly available model in the GPT-5 family while retaining a standard decoder-only transformer architecture. This comparison allows us to assess whether a compact, modular encoder–decoder model such as KIT can achieve performance comparable to a multi-billion-parameter decoder-only model. Gemma-T5 was chosen as a strong encoder–decoder baseline, as it is a recent generative transformer that shares architectural principles with KIT, enabling a more direct comparison within the same model class.
The evaluation also involves the original TinyStories 33M decoder, which was used to initialise the KIT model decoder, and an instruction-tuned version of TinyStories 33M. These models get the same set of questions and context pieces as every other. The aim of this evaluation and comparison is to determine how the KIT architecture and further fine-tuning affect the behaviour of these decoders in terms of answer relevance and language quality. This way, we can find out whether or not the KIT model retains the grammatical fluency of the original knowledge-agnostic decoder and whether it boosts the accuracy of responses.
The dataset has an ideal version of the answer, but we decided not to use character-matching metrics like ROUGE [
40] or BLEU [
41] as they cannot evaluate the correctness of the answer if the model rephrases it. Instead, the GPTscore was used to compare the model-generated answer, the ideal version, the question and the context [
29,
42]. This metric allows us to create a process similar to the real human evaluation, where multiple metrics like correctness, fluency, style, spelling and others can be evaluated all at once. The score should not penalise the answer for being different from a human-written version if it is correct in terms of the content and language.
We use the gpt-5-mini model with top-p 0.0 as the evaluator. We measure 5 binary features (1 would be the passed evaluation and 0 would be the failed):
Correctness. This evaluation would be passed if the answer is factually correct and the meaning corresponds to the ideal version and context, even if there are minor grammatical or spelling mistakes.
Grammatic and syntactic fluency. The text has to be coherent and correctly structured in order to pass this metric.
Spelling. If there are multiple spelling errors, which reduce the readability of the text, then it should be penalised by getting 0.
Relevance. The answer has to directly address the question and be grounded in facts from the context.
Completeness. The answer has to cover all the key points mentioned in the context and the human-written version of the answer.
The final score can be calculated in the following way:
where
is the number of features evaluated by the LLM, and
is the value of each feature. The score would be in the [0, 1] range, where 0 means none of the features passed, and 1 means all of the checks were passed successfully. For every question, we generate 3 answers and choose the best evaluation result for further aggregation.
4. Results
Section 4 is organised to present all obtained results (both failed and successful ones) during KIT training and evaluation. First, we show and explain which training methods did not bring the expected result and were rejected in the end. Then, we describe the successful training run and, finally, compare the trained KIT model to others on contextual question answering benchmarks.
4.1. Failed Training Runs
First, we want to clarify the choice of the label smoothing value in this section. We ran all experiments with label smoothing set to 0.0, meaning the model is required to assign full probability mass to the ground-truth next token and receives no reward for distributing probability to other tokens. We tried a few runs with smoothing set to 0.1 and 0.05 using the third methodology (training the model end-to-end, while top encoder blocks are kept frozen). For both such runs, the training and validation loss values decreased steadily. However, once the training was stopped after 120 thousand steps, the model turned out to be repetitive and could only generate either “<” or “>” characters. It got overfitted to the XML structure instead of the actual answer tokens, as they appear frequently and are predictable due to all samples using the same template. The optimal solution shifted towards allocating probability mass to high-frequency structural tokens, since it reduces the loss even if the predicted token is actually incorrect.
Figure 2 and
Figure 3 show the train and validation loss for the training run with label smoothing set to 0.1.
Then, we verified methods 1 and 2 with two-step training, where most of the model is kept frozen for the first stage (either only the projection network gets trained or the projection network and the last N blocks of the encoder). Both approaches failed as the model was able to capture features from the encoded context, but lost fluency and got repetitive. The best result after two stages was generating a part of the actual correct answer, but it would either be repeated until the max tokens limit was exceeded or the model could not generate the full answer. Also, both runs kept a high difference between the training and validation loss (1.0+ difference, which decreased to 0.5 after the second stage). Increasing the gradient accumulation steps value did not improve it (we tried two and three accumulation steps for these runs at both stages). Loss trajectories and issues were similar for both approaches, and unfreezing bottom N encoder layers during the first stage did not improve the result, so in
Figure 4,
Figure 5,
Figure 6 and
Figure 7, we show the loss trajectory for the second methodology only, as the first one followed a similar pattern.
The second stage starts with a significant increase in training loss, followed by a rapid decline to a range similar to the end of the first stage. This coincides with getting decoder blocks unfrozen, so the model changes its predictions substantially and has to stabilise first before trying to optimise further.
Training method №3 turned out to be the only successful run out of five approaches we tried, so we would like to focus on it more in the next subsection. However, methods 4 and 5, which were defined as slight modifications of method 3, did not prove to be successful. Method 4 tries the same end-to-end training as method 3, but replaces naturally tokenised XML structure with tags <answer> and <question> with new special tokens. However, the training loss plateaued at 1.5 and stopped decreasing steadily. The model lacks pretrained embeddings for these new tokens, leading to poorer conditioning of the decoder and slower optimisation. In contrast, representing the structure using standard subword tokenisation and already learnt token embeddings for the words “question” and “answer” results in a more stable training despite a higher token count.
Figure 8 shows the training loss plot for this experiment.
Finally, method 5 applies loss masking to the question portion of the decoder content. The decoder loss would be calculated only over the answer tokens after the <answer> tag. This approach also fails: the training loss decreases steadily, but the model converges to a solution in which it prematurely predicts the EOS token. Answers in the training dataset are mostly short (approximately eight tokens per answer on average), so early termination becomes a low-loss strategy for the masked objective.
Figure 9 shows the training loss plot for this experiment.
4.2. Successful Train Run Results
As stated earlier, training methodology 3 proved to be the only successful experiment. This approach unfreezes the whole decoder (including token embeddings, language modelling head and all blocks), the projection network and the bottom N layers of the encoder model (five in our case). Then, the model gets trained end-to-end on the whole dataset in just one stage without unfreezing any additional layers later during training.
The training run was scheduled to last for 3 epochs, but it was stopped after 1.5 epochs due to the validation loss reaching a plateau around 1.0, followed by only marginal improvements. We achieved the training loss of 0.7365 and validation loss of 0.9931 by the end of the run after 5.5 h of training on a single Nvidia L4 GPU in the Google Colab environment.
Figure 10 and
Figure 11 show the train and evaluation loss curves for this experiment.
The training loss rapidly decreases during the initial phase, which is followed by a slower, gradual decline. This indicates a consistent optimisation without loss of oscillation or divergence. The evaluation loss keeps a similar trend with no signs of overfitting or sudden degradation during the training. The difference between training and validation loss is mostly in the interval of 0.3–0.4, so it follows the training loss steadily.
This approach is the only one to achieve a stable optimisation, where the model does not converge into degenerative solutions. The trained model does become repetitive, and it can clearly repurpose the context from the encoder into an answer to the provided question. This will be shown in the next subsection, as we saved the resulting weights and used them for further benchmarking.
The success of this approach can be attributed to the simultaneous adaptation of the encoder output distribution and allowing the decoder to reliably learn how to attend to and utilise encoder outputs. Gradient flow is restricted only to the last few blocks of the encoder, enabling task adaptation without disrupting already learned semantic associations and representation in the latent space. Such a joint update of weight matrices between the decoder, the encoder and the projection network leads to a more stable convergence, avoiding degenerative behaviour as the encoder gets trained based on the decoder’s training signal and co-optimises as a result.
Phased unfreezing with step-by-step task adaptation proposed in methods 1–2 leads to a non-stationary training signal as encoder output distribution or the projection network gets constantly updated, while the decoder stays frozen for the whole first stage of the tuning. This way, the same input would produce different latent space representations from the encoder, which makes previously learned attention patterns of the decoder irrelevant and suboptimal. Encoder latent representations shift without feedback from the decoder. When the decoder gets unfrozen at the next stage, it has to adapt to the latent space, which was not jointly optimised for the conditional generation task.
The observed repetitiveness of the checkpoints produced by the first two training methods can be caused by the mentioned misalignment between the decoder’s attention mechanism and the encoder latent space. Once the decoder is unfrozen, cross-attention receives an unstable or poorly calibrated signal, reducing its ability to condition next-token probabilities on encoder outputs. In such cases, the decoder increasingly relies on autoregressive self-attention, which favours locally high-probability continuations such as token or sequence of tokens repetition.
Joint training enables continuous adaptation between the encoder latent space and the decoder cross-attention mechanism. The conditioning training signal remains consistent across the whole training run. This prevents conditional entropy collapse and avoids degenerative repetition behaviour.
Although methods 4 and 5 were derived from the successful third one by introducing additional constraints, they destabilised the conditional generation training signal. In Method 4, the introduction of special structural tokens increases sequence complexity without improving semantic conditioning, resulting in slower convergence and weaker encoder–decoder alignment. In contrast, Method 5, which applies loss only to the answer tokens, significantly reduces supervision over the full conditional sequence and biases the decoder toward early end-of-sequence (EOS) prediction. This indicates that the success of the third method arises from the stable constant joint optimisation instead of the loss or vocabulary constraints.
4.3. Benchmarking of the Model
GPTscore [
13] was calculated for three outputs of three models for each validation sample. We benchmarked the trained and saved version of KIT against gpt-5-nano (instruction-tuned decoder only large language model), Gemma T5 2B (encoder¬–decoder large language model), original TinyStories 33 million base decoder (which was used to initialise the decoder for KIT model) and instruction-tuned TinyStories 33 million decoders. The size of the gpt-5-nano is not disclosed, but the Gemma model has 100 times more active parameters than the KIT model after the first forward pass (once only the decoder is active). Even accounting for the encoder and three decoder forward passes due to best-of-three sampling, the number of active parameters during inference remains below 200 M, which is an order of magnitude smaller than Gemma T5 2B. Initial TinyStories models are used to compare KIT-produced results with their originally generated answers to compare the coherence and quality of the generated text. These models are not expected to produce a correct and relevant answer due to a lack of specialised training.
The KIT checkpoint obtained as a result of training method 3 was used for this evaluation. The other four checkpoints were excluded from this evaluation as empirical tests proved them to be incapable of generating coherent language. Models either stopped the generation too early or fell back to repetition of the same patterns. Including them would not provide any meaningful comparisons due to a lack of actual text to evaluate, so we focus on the checkpoint, which is able to generate valid text using injected knowledge.
Table 5 shows the resulting values of the GPTscore for every benchmarked model on 17,950 validation samples from the RepliQA dataset.
The KIT model achieves comparable results in terms of correctness (0.90), but it underperforms larger models on grammar, spelling and completeness. This indicates limitations in stylistic fluency rather than factual retrieval issues. Considering the difference in size between the models and the limited training data and computational resources, these results can be considered successful. The model shows potential for further training as its fluency and the completeness of retrieval can be improved with more data and higher quality of training samples. The same can be said about relevance, as it can be improved with an extended training run and an improved synthetic data generation pipeline.
It is also worth noticing that a larger encoder–decoder Gemma T5 2B is comparable in its performance to a decoder-only gpt-5-nano.
The original decoder and the instruction-tuned one produce almost identical results, with a slight difference in relevance. Both models generate irrelevant answers, which are mostly correct in terms of grammar and do not provide any real factual knowledge required to answer the question. The lack of accuracy and relevance for the base decoder can be attributed to its lack of instruction fine-tuning, as it keeps generating stories instead of actual answers. The analysis of instruction-tuned model answers shows that it still struggles with in-context learning tasks and either falls back to generating stories like in pretraining or tries to process the task as a summarisation problem and fails to capture correct details. The attempt at summarisation is caused by the high number of such samples in the TinyStories Instruct training dataset.
However, the most important conclusion from these results is that the KIT model increases correctness, relevance and completeness of answers in comparison to the basic knowledge-agnostic generative decoder and its instruction-tuned version. It is able to generate correct answers without losing the ability to write grammatically correct text completely. The drop in grammar and spelling (from 0.81 to 0.65 and 0.55 to 0.39) can be explained by the quality of the training data, as most samples had short answers with just multiple words instead of a complex sentence. In fact, the comparison in terms of grammar or spelling quality would probably be incorrect for these models, as KIT attempts to generate text relevant to the actual context and question, while TinyStories decoders just generate texts similar to patterns learned in their pretraining. The results produced by KIT and these decoders are too different in terms of content to produce any meaningful comparison in terms of language coherence.
The comparison between the original knowledge-agnostic decoder and the KIT model suggests that the KIT architecture effectively decouples knowledge representation from language generation: the encoder focuses on integrating factual context, while the decoder handles surface grammar and coherence. This modular separation provides a foundation for future knowledge updates or extensions, where new information can be passed to the encoder for conditional generation, allowing the decoder to remain independent and focus solely on producing coherent text.
5. Discussion
The encoder–decoder framework named KIT, which was presented in the current research, produced a 0.66 mean GPTscore over 17,950 validation samples (the best answer out of three was chosen for every sample). It stays competitive with larger models like gpt-5-nano and Gemma T5 2B by mean correctness of the generated answer, despite significantly smaller size limitations in terms of training data and compute (in comparison to the two other models). These results position KIT as a parameter-efficient alternative for knowledge-grounded question answering, where correctness is prioritised over stylistic fluency. It should be scaled further in terms of the training data size and number of parameters to verify it and check how the scaling laws get applied to the KIT architecture.
Inspection of the generated answers proves that the model fails mostly at long answers, which require a broad and detailed response, which can be explained by the mean length of the answer in the training data being around 8 tokens. The model excels at citing short facts like names, dates or numeric values like percents or differences. However, it can produce less natural and more repetitive responses when it has to give a long explanation as an answer. This behaviour is likely driven by the training data distribution, where the mean answer length is approximately 8 tokens, limiting the model’s exposure to long-form generation. Importantly, this suggests a data-centric rather than architectural limitation, indicating that performance on long answers may improve with longer targets and curriculum-style training. The observed repetitiveness appears to reflect stylistic limitations rather than a pathological degeneration mode, as the repetitions remain semantically consistent and localised to explanatory responses.
Comparison of the jointly trained KIT model to the original knowledge-agnostic generative decoder by GPTscore on the question answering task proves that the model keeps its ability to generate coherent and grammatically correct English text, with a drop in quality caused by the training data format. At the same time, the proposed architecture allows for injecting new knowledge into the decoder dynamically and increases the accuracy of the answers, as the original knowledge-agnostic decoder was not able to produce relevant answers even after instruction tuning and context being available.
Multiple methodological findings emerge from the conducted experiments. The KIT model should not use a similar methodology to Vision LMs as it leads to a significant performance degradation (repetitiveness, lack of fluency and difficulty generating consistent and correct answers). Such methods require multiple stages of training with gradual unfreezing of model layers, which makes the training longer and more expensive without achieving the expected result or gaining any improvements over a simple end-to-end training with one stage. Label smoothing, loss masking or injection of new special tokens do not lead to any improvements, either, as the optimisation either slows down or the optimal solution shifts into early termination or structural generation without capturing the semantic features of the answer.
In practice, this strategy of teaching only new layers and some encoder blocks performed worse than joint fine-tuning, likely because task adaptation required coordinated updates in both the encoder and decoder. The frozen decoder was not able to fully leverage the modified encoder representations, indicating that the joint encoder–decoder interaction is critical for optimal performance. This points to the need to train the proposed architecture end-to-end as a single model, as separate fine-tunings of the encoder or decoder could lead to performance degradation.
One of the advantages of the proposed architecture, which has to be highlighted, is the computation cost of running the trained KIT model. It remains almost identical to the original decoder-only GPT-Neo with 33 million parameters, as the 150 million ModernBERT encoder has to be called only once per generation to obtain its hidden states. Also, the encoder hidden states can be cached for further usage, which eliminates the call for the encoder portion of the model completely. This keeps the number of active parameters under 40 million most of the time, which makes the model available even for CPU-only inference on low RAM computers.
Several limitations of the current research have to be highlighted. First of all, we would like to reproduce the achieved results with the models of different sizes (both smaller and larger ones) to check the scalability of the proposed architecture. Currently, it is limited by available computational resources, so we plan to check how KIT works with larger and smaller decoders in our further research. Second, the dataset was mostly built around triplets with a small context. A total of 80% of samples had context length less than 1000 tokens, so the current version of KIT works best with short context and can heavily hallucinate when the context is large. We plan to expand the synthetic data generation pipeline and train a further version for longer context and answers. Also, we admit that the number of training samples can be significantly increased from ~500 thousand to possibly achieve better results, even with a model of similar size to the first version of KIT. It can be done both by scraping more pre-built datasets and by the already mentioned extension of the synthetic data generation pipeline.
Another significant limitation of the first version of KIT is that it gets trained with the same mandatory piece of context for every token for every training sample. We would like to check whether it would be possible to avoid using encoder hidden states at all for some parts of the generation (like structure, style, possible small talk or introductory phrases) and whether it would be possible to merge the current KIT architecture with search engines to allow choosing a different context for every token. This way, the model would be able to cite multiple sources during the generation of one text, and every token would be traced to its origin.
The first version of KIT remains prone to hallucinations and errors in context attribution, particularly in cases involving complex entity names or ambiguous terms with multiple possible interpretations. These failures suggest limitations in context recall and grounding under increased semantic ambiguity. This can be improved by increasing the number of training samples and the parameter count.
Context utilisation could likely be improved through further training on more diverse datasets, including samples with greater variation in answer length, complexity, and question formulation. In particular, increasing the diversity of question styles and retrieval strategies may help the model generalise beyond the relatively homogeneous patterns present in the current training data.
Additionally, the current training setup assumes that the provided context always contains sufficient information to answer the question. As a result, KIT has not been exposed to cases where the correct behaviour is to abstain or indicate that the answer is not present in the context. Investigating training regimes that explicitly model such unanswerable cases remains an important direction for future work. This limitation is primarily attributable to the lack of representative training data rather than an inherent architectural disadvantage.
Extension of the pretraining of the decoder (before the end-to-end encoder–decoder training) can be a direction for further research on the KIT architecture, too. It can be trained with not only simple stories to teach it grammatical and syntactic fluency, but also examples of math, coding and general reasoning tasks. We want to check whether the knowledge-injected encoder–decoder architecture would be able to solve complex tasks that require both reasoning and external knowledge without memorising the actual knowledge and getting it from the encoder. Potentially, such architecture should work for small reasoning models, where the main task of the decoder is to keep a valid reasoning chain and decompress useful context from the encoder hidden states.
6. Conclusions
This research proposes an encoder–decoder generative transformer architecture named KIT (Knowledge Injected Transformer) trained to solve the question answering task without requiring factual memorisation. This is achieved by merging two pretrained models: a bidirectional transformer encoder and a small autoregressive decoder trained only to be fluent in English. The model remains computationally efficient as it does not require storing facts in its weights, and its memory can be updated without full retraining, or it can be restricted to a specific, narrow context with no fine-tuning. The model can be efficiently trained end-to-end with top encoder blocks staying frozen, so that only the distribution of its hidden states can be aligned with the decoder without actually changing its learnt behaviour.
The research benchmarks the novel encoder–decoder model against a larger model with a similar architecture (Gemma T5 2B) and a decoder-only gpt-5-nano. Benchmark results indicate that the KIT model can be competitive in terms of fact recall and answer correctness, but it underperforms larger models in terms of stylistic features on long answers. Observed issues are data-driven and can be fixed by expanding the training dataset further and increasing the diversity and quality of samples.
The work highlights the relevance of encoder–decoder architecture for small language models, as they were overshadowed by decoder-only models in the recent research landscape. The study also contributes a new synthetic dataset for the contextual question answering task with almost 250 thousand records of context–question–answer triplets.
The practical novelty lies in building a small transformer model, which can be run locally and gradually updated with new knowledge without training at all. The architecture is interpretable as every generated token can be traced to the encoder context and can be optimised by caching encoder outputs for further usage. This offers an alternative to decoder-only models for local RAG or question answering systems with a low computational budget.
Future research should revolve around extending the datasets for both separate components pretraining and the final end-to-end training of the proposed model, as well as extending the architecture to cite multiple pieces of context at once during the generation of one text. Such advances would provide more options for local language models inference and increase the controllability of generative transformers by tracing each generated token to a certain source.
Finally, we would like to scale this model to not only English, but also other languages and train a multilingual version of a module encoder–decoder knowledge-injected transformer.