A Multi-Teacher Knowledge Distillation Framework with Aggregation Techniques for Lightweight Deep Models

Hamdi, Ahmed; Noura, Hassan N.; Azar, Joseph

doi:10.3390/asi8050146

Open AccessArticle

A Multi-Teacher Knowledge Distillation Framework with Aggregation Techniques for Lightweight Deep Models

by

Ahmed Hamdi

^†,

Hassan N. Noura

^*,†

and

Joseph Azar

Université Marie et Louis Pasteur, CNRS, Institut FEMTO-ST (UMR 6174), F-90000 Belfort, France

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Syst. Innov. 2025, 8(5), 146; https://doi.org/10.3390/asi8050146

Submission received: 2 August 2025 / Revised: 7 September 2025 / Accepted: 24 September 2025 / Published: 30 September 2025

Download

Browse Figures

Versions Notes

Abstract

Knowledge Distillation (KD) is a machine learning technique in which a compact student model learns to replicate the performance of a larger teacher model by mimicking its output predictions. Multi-Teacher Knowledge Distillation extends this paradigm by aggregating knowledge from multiple teacher models to improve generalization and robustness. However, effectively integrating outputs from diverse teachers, especially in the presence of noise or conflicting predictions, remains a key challenge. In this work, we propose a Multi-Round Parallel Multi-Teacher Distillation (MPMTD) that systematically explores and combines multiple aggregation techniques. Specifically, we investigate aggregation at different levels, including loss-based and probability-distribution-based fusion. Our framework applies different strategies across distillation rounds, enabling adaptive and synergistic knowledge transfer. Through extensive experimentation, we analyze the strengths and weaknesses of individual aggregation methods and demonstrate that strategic sequencing across rounds significantly outperforms static approaches. Notably, we introduce the Byzantine-Resilient Probability Distribution aggregation method applied for the first time in a KD context, which achieves state-of-the-art performance, with an accuracy of 99.29% and an F1-score of 99.27%. We further identify optimal configurations in terms of the number of distillation rounds and the ordering of aggregation strategies, balancing accuracy with computational efficiency. Our contributions include (i) the introduction of advanced aggregation strategies into the KD setting, (ii) a systematic evaluation of their performance, and (iii) practical recommendations for real-world deployment. These findings have significant implications for distributed learning, edge computing, and IoT environments, where efficient and resilient model compression is essential.

Keywords:

knowledge distillation; cross-modal; neural network compression; downsampling; multi-teachers; multi-rounds of distillation

1. Introduction

Deep learning has produced increasingly complex neural networks, often with millions or billions of parameters [1,2]. While these large models achieve state-of-the-art results, their computational and memory demands [3] limit deployment in edge and IoT devices. Neural network compression [4] seeks to address this challenge, reducing model size and computation while retaining performance. Among compression techniques, Knowledge Distillation (KD) [5] is especially effective, transferring knowledge from a large teacher to a compact student, and is widely used in resource-constrained environments [6].

While traditional KD typically involves a single teacher [7], Multi-Teacher KD (MTKD) [8] leverages multiple teachers, offering richer supervision. However, the key challenge lies in aggregation: outputs from diverse teachers may be noisy, conflicting, or adversarial [9,10]. Existing aggregation techniques, such as averaging or median rules [11,12], are often inadequate when teacher reliability varies.

In emerging applications, such as drones [13], robotics [14], and autonomous vehicles [15], compact and robust models are required for real-time decision-making under latency and energy constraints [16]. These scenarios motivate stronger aggregation mechanisms that can adapt across heterogeneous teachers while supporting efficient on-device inference.

To overcome the limitations of static, single-step aggregation, we focus on multi-round aggregation. Instead of aggregating once, teacher knowledge is progressively refined across rounds, reducing bias and improving convergence [17]. Inspired by Federated Learning (FL) techniques [18], we explore robust strategies such as Krum and Byzantine-Resilient Aggregations, repurposed in a centralized KD setting. Crucially, unlike prior Multi-Teacher KD frameworks that rely on static averaging or meta-learning, our approach supports adaptive switching between aggregation types and operates at both probability-distribution and distillation-loss levels. This combination of multi-round, parallel, and dual-level aggregation, systematically applied to KD, represents the core novelty of our Multi-Round Parallel Multi-Teacher Distillation (MPMTD) framework.

1.1. Motivation

Robust aggregation methods in FL, such as Krum [19] and Byzantine-Resilient techniques [20], mitigate unreliable or adversarial inputs in distributed systems. Yet their potential in the KD setting remains underexplored. Existing aggregation strategies often rely on single-step aggregation [21], which may fail to reconcile the diversity in teacher outputs, leading to suboptimal student performance. Multi-round aggregation offers a promising solution by iteratively refining knowledge transfer, progressively filtering out inconsistencies, reducing noise, and improving convergence. By adapting robust aggregation strategies from FL into KD, our work enhances the effectiveness of multi-teacher aggregation and improves student generalization.

We also consider a cross-modal KD setting, where teacher models are trained on RGB images while the student operates on grayscale inputs. This setup reflects practical constraints often found in edge devices with limited sensing or computational capacity. In such cases, multi-round aggregation is particularly valuable, as it enables the student to gradually align with diverse teacher signals despite modality differences, leading to more stable and generalizable outcomes.

1.2. Contributions

This work makes the following key contributions:

A novel iterative Multi-Round Parallel Multi-Teacher KD (MPMTD) framework is proposed to enable adaptive and strategic teacher aggregation, enhancing scalability, robustness, and efficiency.
Robust aggregation techniques from Federated Learning are introduced into the multi-teacher distillation context to address issues such as noise, conflicting predictions, and adversarial teacher behavior.
A comprehensive evaluation of individual and combined aggregation methods is conducted, providing insights into their respective strengths and limitations.
Aggregation is investigated at multiple levels, including both output-level and loss-level fusion, to assess their impact on student model performance.
The optimal number of distillation rounds is identified, showing that most performance gains plateau after 2–5 rounds, thereby informing trade-offs between convergence and computational cost for edge deployment.
The framework is validated under a cross-modal distillation setting (RGB teachers to grayscale student) on a real-world agricultural dataset with fixed splits, demonstrating its practical relevance for low-resource scenarios and real-time applications.

1.3. Organization

The remainder of this paper is structured as follows: Section 2 provides background on lightweight machine learning, neural network compression, KD, and multi-teacher aggregation techniques, with a particular focus on Federated Learning-based aggregation methods. Section 2.8 introduces key aggregation strategies used in MT-KD, including probability distribution aggregation and distillation loss aggregation, and presents the proposed MPMTD framework. Section 4 details the experimental setup, including dataset selection, model architectures, and evaluation metrics, and presents a comprehensive performance analysis of different aggregation techniques. Section 5 discusses the results, analyzing the strengths and limitations of various aggregation approaches and their impact on student model performance.Finally, Section 6 concludes the study, summarizing key contributions and implications. For clarity, Table 1 provides definitions of key variables used in the formulation of the aggregation strategies.

2. Background and Preliminaries

This section will provide an overview of the fundamental concepts and techniques related to lightweight edge machine learning. We will discuss key computational constraints, neural network compression methods, KD strategies, and Federated Learning aggregation techniques that are essential for understanding the challenges and solutions in this domain.

2.1. Edge Devices

Edge devices refer to a broad category of systems, including Internet of things devices [22], drones [23], and robots [24], that operate at the periphery of networks with limited computational resources [6]. These devices often face challenges, such as constrained memory, processing power, and energy efficiency, making the deployment of traditional deep learning models impractical. To address these challenges, lightweight machine learning solutions [25] have emerged as a viable approach, focusing on reducing model size, computational requirements, and energy consumption without significantly compromising performance.

IoT devices, often equipped with sensors, are used for real-time data collection and processing in applications like smart cities, industrial automation, and healthcare. Drones, on the other hand, require efficient on-device processing to perform tasks such as object detection and navigation autonomously. Similarly, robots need compact and efficient ML models to manage perception, decision-making, and control in dynamic environments.

Lightweight ML techniques, such as pruning, quantization, and KD, are essential for enabling efficient deployment on edge devices. KD, in particular, allows complex models to transfer knowledge to simpler, resource-efficient student models, making it possible to execute advanced ML tasks directly on edge devices.

This work contributes to enhancing lightweight ML for edge devices by proposing a multi-round, parallel Multi-Teacher KD framework. This approach focuses on efficient knowledge transfer through adaptive aggregation strategies, optimizing model performance while addressing the computational and energy constraints typical of edge devices. The proposed framework enables the effective deployment of ML models on IoT devices, drones, and robots, facilitating real-time tasks such as object recognition, anomaly detection, and autonomous decision-making with improved efficiency.

2.2. Computational Resource Constraints and Deployment Challenges

The increasing complexity of deep learning models [26] presents significant challenges for efficient deployment in real-world environments. As AI applications expand across diverse domains, ensuring computational efficiency, scalability, and sustainability becomes a critical concern.

2.2.1. Computational Resource Constraints

The rapid evolution of AI and deep learning has led to groundbreaking advancements across various fields, including healthcare, natural language processing, and autonomous systems. Increasingly complex models have powered these achievements with billions of parameters, such as GPT-4 [27], ViTs [28], large CNNs [28], and large-scale ensemble architectures. However, these models’ unprecedented size and complexity pose significant challenges for their deployment, particularly in resource-constrained environments.

Large-scale neural networks demand substantial memory resources [29], both during training and inference. For instance, modern models often require multiple gigabytes of memory to store their parameters and intermediate computations. This makes their deployment on devices with limited memory, such as smartphones, IoT devices, or edge computing platforms, a formidable challenge. The memory overhead also restricts their scalability in distributed systems, where bandwidth limitations exacerbate the problem.

Real-time applications [30], such as autonomous driving [31] or medical diagnostics [32], impose strict latency requirements. Despite their high accuracy, large models often suffer from prolonged inference times due to their depth and the number of parameters. This delay is particularly problematic in time-sensitive scenarios, where even milliseconds of latency can have critical consequences. The energy demands of state-of-the-art models are another major barrier to widespread deployment [33]. Training a single large-scale model can consume hundreds of megawatt-hours of electricity, with significant carbon emissions. During inference, the high energy requirements of such models limit their utility in battery-powered devices and other low-power settings, where efficiency is paramount.

2.2.2. Deployment Challenges

The deployment of AI models in real-world scenarios often requires a balance between performance and resource efficiency. Large models, while highly accurate, face challenges, such as those summarized in Table 2.

2.3. Neural Network Compression

The increasing complexity of deep learning models presents challenges for deployment on resource-constrained devices. Neural network compression techniques aim to reduce model size and computational cost while preserving performance. Key techniques are summarized in Table 3.

One widely used method is pruning [34], which removes less important connections or neurons to reduce model size and computation. Various pruning strategies exist: weight pruning eliminates connections with small weights, neuron/filter pruning removes entire neurons or filters, structured pruning eliminates groups of weights such as entire filters or channels for better hardware acceleration, while unstructured pruning removes individual weights in a more fine-grained manner. Another essential technique is quantization [35], which reduces the precision of weights and activations, significantly lowering memory usage and computational cost. This can be achieved through post-training quantization, where the model is quantized after training, or quantization-aware training, where the model is trained with simulated quantization to maintain better accuracy. Additionally, low-rank factorization [36] is employed to decompose weight matrices into lower-rank matrices, effectively reducing model complexity. An alternative strategy is designing compact network architectures [37] from the outset, ensuring efficiency without requiring post-training modifications. Examples include MobileNet, which utilizes depthwise separable convolutions, and ShuffleNet, which employs channel shuffling to improve efficiency. Lastly, KD [38], which is discussed in detail in the following subsection, transfers knowledge from a large, complex teacher model to a smaller, more efficient student model, enabling significant reductions in model size while preserving performance. These techniques are often combined to achieve even greater compression ratios.

Figure 1 illustrates a visual representation of these compression techniques.

2.4. KD

KD [38] is a neural network compression technique where a smaller, more efficient student model is trained to replicate the behavior of a larger, high-capacity teacher model. This approach enables the deployment of deep learning models in resource-constrained environments by reducing computational requirements while maintaining performance.

The process involves training the student model to match the softened output probabilities of the teacher model. These softened outputs are obtained by applying a temperature parameter T to the softmax function in the teacher model’s final layer, producing a probability distribution over classes that reveals the relative confidence of the teacher in its predictions. The softmax function with temperature T is defined as follows:

P_{i} = \frac{exp (z_{i} / T)}{\sum_{j} exp (z_{j} / T)}

(1)

where

z_{i}

represents the logits (pre-softmax activations) for class i. A higher temperature T produces a softer probability distribution, providing more information about the teacher’s uncertainty and inter-class relationships.

The student model is trained using a combination of two loss functions: the standard cross-entropy loss with the true labels and the KL divergence loss between the teacher’s and student’s softened output distributions. The total loss L is given by the following:

L = α CE (y_{true}, y_{student}) + (1 - α) T^{2} KL (P_{teacher}^{T} ‖ P_{student}^{T})

(2)

where

$α$ is a weighting factor that balances the two loss terms;
$y_{true}$ denotes the true labels;
$y_{student}$ represents the student’s predictions with $T = 1$ ;
$P_{teacher}^{T}$ and $P_{student}^{T}$ are the softened output probabilities of the teacher and student models, respectively;
$T^{2}$ is included to account for the gradients’ scaling effect due to the temperature T.

Recent advancements in KD have explored various strategies to enhance the effectiveness of knowledge transfer. For instance, the Diversity-Enhanced KD (DivKD) [39] model introduces an adaptive diversity distillation method, where the student model learns diverse equations by selectively transferring high-quality knowledge from the teacher model. This approach incorporates a diversity prior-enhanced student model to better capture the diversity distribution of equations by utilizing a conditional variational autoencoder.

Another notable development is the Decoupled KD (DKD) [40] framework, which reformulates the traditional KD loss into two components: Target Class KD (TCKD) and Non-Target Class KD (NCKD). This decoupling allows for more efficient and flexible knowledge transfer, leading to improved performance in student models. These advancements demonstrate the ongoing efforts to refine KD techniques, making them more effective for various applications, including deploying lightweight models in resource-limited settings.

2.5. MT-KD Approaches

KD has emerged as a powerful technique for improving the efficiency of lightweight neural networks by transferring knowledge from pre-trained teacher models to student models. While traditional single-teacher KD methods have demonstrated significant improvements in model generalization, they often suffer from limited knowledge diversity. To address this, MT-KD has been proposed, where multiple teacher networks collaboratively guide a student, enhancing robustness, generalization, and representation learning [41].

A key challenge in MT-KD is effectively aggregating knowledge from multiple teachers. Early approaches relied on naive averaging of teacher outputs, which often failed to account for the varying expertise of individual teachers [11,42]. Recent advancements have introduced adaptive weighting mechanisms, which dynamically assign importance weights to teachers based on their relevance to specific instances [43]. This ensures that the student prioritizes knowledge from the most informative teachers, leading to more effective learning.

Another significant development in MT-KD is the integration of multi-level knowledge transfer [44]. While traditional KD methods primarily focus on output-level supervision through soft labels, recent works extend this to intermediate feature representations. By distilling structural knowledge from multiple teachers, students can learn richer feature representations, improving performance in complex tasks such as object recognition and anomaly detection. This hierarchical distillation process [45] enables students to leverage both high-level and intermediate-level knowledge, mimicking the internal learning dynamics of multiple teacher models.

Teacher selection strategies have also been explored to dynamically identify the most relevant teachers based on instance-specific characteristics [46]. Instead of treating all teachers equally, these approaches use latent representations to determine the influence of each teacher on a per-instance basis. This is particularly beneficial in tasks where different teachers specialize in different aspects of data representation, allowing the student to selectively incorporate the most useful knowledge.

MT-KD has found applications beyond standard classification tasks, such as in anomaly detection [47], where the goal is to learn normal data distributions without explicit access to anomalous samples. In this setting, MT-KD leverages multiple pre-trained teachers to provide diverse perspectives on normal data, enabling the student to detect deviations more effectively. Innovations such as autoencoder-based reconstruction mechanisms [48] have further refined the importance weights of different teachers by incorporating reconstruction errors into the distillation process.

Despite these advancements, MT-KD faces several challenges. Computational efficiency remains a primary concern, as training with multiple teachers significantly increases resource requirements. Optimizing the trade-off between knowledge richness and computational cost is an active area of research. Additionally, conflicting teacher knowledge [49] poses a challenge when different teachers provide contradictory outputs, the student model may struggle to reconcile the differences, leading to suboptimal learning. Addressing this requires more sophisticated aggregation frameworks, such as confidence-weighted averaging or consensus-driven knowledge fusion [50].

Future research directions in MT-KD include the development of efficient teacher selection algorithms, meta-learning approaches for dynamic weight assignment [51], and self-distillation techniques [52] that allow students to refine their knowledge base progressively. These innovations will further enhance the applicability of MT-KD across a wide range of machine learning domains.

2.6. Aggregation Techniques in MT-KD

MT-KD methods can be categorized into two groups: Parallel aggregation and Successive aggregation. The differences between these two approaches are illustrated in Figure 2 and Figure 3.

2.6.1. Parallel Aggregation

In Parallel aggregation, teacher outputs are combined simultaneously using various strategies, such as averaging, confidence-aware weighting [53], and attention-based aggregation. Task-specific probes with adaptive correction enable the student model to align its learning with predefined objectives, thereby improving performance in structured tasks. Confidence-aware methods, such as Confidence-Aware MT-KD (CA-MKD), assign varying importance to teacher predictions based on their reliability, ensuring that more accurate teachers contribute more effectively to the student’s learning. Adaptive temperature-based methods [10] refine the fusion process by dynamically adjusting softmax scaling to balance the diverse influences of multiple teachers. Meta-learning-driven approaches, such as Meta-MT-KD, optimize aggregation by learning how best to combine multiple teachers flexibly and adaptively. While Parallel aggregation methods are computationally efficient and leverage diverse teacher knowledge in a single training phase, they may overlook the sequential relevance of teacher expertise.

As summarized in Table 4, the Parallel aggregation techniques used in this work, some of which are based on Federated Learning (FL) aggregation strategies, follow distinct steps in Pre-Aggregation, Aggregation, and Post-Aggregation to optimize knowledge transfer and robustness.

2.6.2. Successive Aggregation

In Successive aggregation, the student sequentially learns from different teachers over time. This can be structured at the mini-batch level, like switched training, or over multiple epochs. Some methods involve progressive fine-tuning, where each teacher provides specialized knowledge at different stages. Multi-Teacher Progressive Distillation [54] allocates specific training phases to individual teachers, ensuring deeper adaptation. Stage-wise KD techniques [55] guide the student through intermediate feature representations, allowing for gradual refinement of knowledge. Two-stage Multi-Teacher KD [56] approaches leverage large-scale pre-training before fine-tuning with multiple teachers. Successive methods provide structured learning advantages but may introduce longer training times and require careful scheduling to balance contributions from different teachers.

Despite their respective benefits, both approaches have trade-offs. Parallel aggregation efficiently consolidates multiple teacher insights but may struggle with teacher discrepancies. Successive aggregation provides structured learning but requires more time and tuning. Future work aims to integrate these strategies into hybrid methods to optimize Multi-Teacher KD.

2.7. Aggregation Techniques in FL

Federated Learning is a decentralized learning paradigm enabling multiple clients to train a shared global model collaboratively without exchanging raw data. This approach enhances data privacy and security, making it particularly useful in healthcare, finance, and edge computing applications. FL follows an iterative process where clients train local models using their private data and share only model updates with a central server, which then aggregates these updates to improve the global model.

Several research efforts have explored different aspects of FL, including aggregation strategies, optimization techniques, and security enhancements. One of the foundational works in FL is Federated Averaging FedAvg [57], which introduced a simple yet effective approach for aggregating local updates by computing a weighted average based on client data sizes. Despite its effectiveness, FedAvg struggles in non-Independent and Identically Distributed (non-IID) settings, where client data distributions vary significantly. To address this issue, several works, such as FedProx [58], introduced regularization terms to mitigate client drift and improve model convergence in heterogeneous environments.

Other studies have focused on optimizing FL for large-scale deployment. FedNova [59] presented a normalized averaging technique to ensure fair contribution from clients with varying computational capabilities, while Scaffold introduced variance reduction techniques to address client-side drift. Additionally, personalized Federated Learning approaches such as Per-FedAvg [60] and FedPer [61] allow clients to maintain personalized components in their models, ensuring that knowledge transfer remains effective even with diverse data distributions.

Security and privacy are critical concerns in FL, prompting research into secure aggregation methods. Techniques such as homomorphic encryption and secure multiparty computation enable model aggregation without exposing individual client updates. Differential privacy-based approaches, such as FedSGD [62] with noise injection, further enhance security by ensuring that individual contributions remain indistinguishable.

Another important direction in FL research is hierarchical Federated Learning, where intermediary nodes aggregate updates before transmitting them to a central server. This architecture reduces communication overhead and enhances scalability, making it suitable for edge computing environments. Blockchain-based decentralized aggregation has also been explored to eliminate reliance on a central server, improving robustness in federated systems.

The relevance of FL to MT-KD lies in its well-established aggregation techniques, which can be adapted to optimize knowledge extraction from multiple teachers. Similar to how FL aggregates model updates from different clients, these techniques can be leveraged to efficiently combine knowledge from various teacher models while ensuring robustness and adaptability. By integrating aggregation strategies from FL, Multi-Teacher KD can better optimize the selection and fusion of knowledge, ultimately improving the student model’s learning process and performance.

Overall, FL continues to evolve with advancements in aggregation strategies, security measures, and optimization techniques. Future research aims to integrate meta-learning, reinforcement learning-based aggregation, and decentralized federated frameworks to further enhance Federated Learning systems’ efficiency, fairness, and privacy.

2.8. Overview of Aggregation Strategies

In this section, we present and describe a comprehensive set of key aggregation techniques used in our MPMTD framework.

2.9. Aggregation Function

Aggregation techniques are designed to aggregate predictions from multiple teacher models into a final ensemble prediction

P_{agg} (x)

to guide the training of a student model. The aggregation methods are categorized into standard and robust techniques, each addressing specific challenges such as computational efficiency, outlier sensitivity, and adversarial inputs.

Let

T = {T_{1}, T_{2}, \dots, T_{N}}

denote a set of N teacher models. Given an input x, each teacher produces a probability distribution

P_{T_{i}} (x)

. The objective is to aggregate these predictions into a final teacher ensemble prediction

P_{agg} (x)

. A detailed overview of the aggregation techniques used in this work, including their descriptions, computational steps, and mathematical formulations, is presented in Table 5.

For a more detailed explanation of the Krum Aggregation and Byzantine-Resilient Aggregation techniques, including step-by-step algorithmic implementations, please refer to Algorithm 1 for Krum and Algorithm 2 for Byzantine-Resilient Aggregation.

Algorithm 1: Krum Aggregation for Multi-Teacher KD.

Algorithm 2: Byzantine-Resilient Aggregation (Trimmed Mean).

Input: Set of teacher models

T

, input x, trim fraction c

Output: Aggregated prediction

P_{agg} (x)

₁: Compute teacher predictions: $P_{T_{i}} (x)$ for each $T_{i} \in T$ ;
₂: Sort predictions along the teacher axis;
₃: Trim the top c and bottom c predictions;
₄: Compute the mean of the remaining values:
₅: $P_{agg} (x) = \frac{1}{N - 2 c} \sum_{i \notin O} P_{T_{i}} (x)$ ;
₆: Return aggregated prediction $P_{agg} (x)$ ;

Furthermore, the proposed framework does not explicitly assign reliability-based weights to individual teachers. Instead, reliability is enforced implicitly through the choice of aggregation strategy. For example, Krum excludes predictions that significantly deviate from the consensus, while Byzantine-Resilient Aggregation discards extreme values. This approach enhances robustness against unreliable teachers without the need for prior accuracy evaluation or explicit weighting.

3. Prposed Method

3.1. Aggregation Levels in MT-KD

In this work, we explore two main aggregation levels in MT-KD. The first approach involves aggregating the probability distributions directly from the teacher models, which is referred to as Probability Distribution (PD) Aggregation. The second approach focuses on aggregating the distillation losses of the teachers, termed DL Aggregation. These two methods offer distinct pathways for the student model to learn from multiple teacher models. PD Aggregation allows the student to combine the probabilistic outputs of the teachers, while DL Aggregation enables the student to integrate the losses computed from each teacher’s guidance. Experiments will be conducted to evaluate the effectiveness of each aggregation level and to analyze their performance using various aggregation techniques. This investigation aims to determine which approach is more advantageous in enhancing the student model’s learning process.

3.1.1. PD

PD refers to the soft outputs of a neural network, typically produced by applying a softmax activation function to the logits (pre-softmax outputs) of the final layer. These distributions represent the predicted probabilities for each class, summing to 1. In MT-KD, these distributions from multiple teacher models can be aggregated to form a unified target distribution for the student model.

To combine PDs, one common approach is averaging, written as follows:

\hat{y} = \frac{1}{N} \sum_{i = 1}^{N} T_{i} (x),

(3)

where

T_{i} (x)

represents the probability output of the i-th teacher for input x, and N is the number of teachers. This provides the student with a single consensus target, simplifying the learning process. However, it may dilute individual teacher contributions if predictions differ significantly. PDs are computed at the output layer of the teacher models and are used to guide the student in mimicking the combined behavior of the teachers. In addition, Figure 4 presents the PD aggregation process from multiple teachers, where the student model is guided by the combined soft predictions.

3.1.2. DL

DL measures the divergence between the student’s predicted probability distribution and the teacher’s. The goal is to align the student’s outputs with the teacher’s, encouraging the student to learn not just the correct predictions but also the relative confidence levels encoded in the teacher’s probabilities. The most common choice for distillation loss is the KL divergence, calculated as follows:

L_{distill} = KL (S (x) | T (x)),

(4)

where

S (x)

and

T (x)

are the student’s and teacher’s PD, respectively.

In the case of MT-KD, DL can be computed for each teacher individually and then combined. This approach allows the student to learn from each teacher separately, preserving the unique contributions of diverse teachers. Figure 5 further demonstrates how DLs from multiple teachers are aggregated to guide the student model’s learning process, reinforcing the concept discussed in this section.

3.1.3. Comparison of Aggregation Levels

To better understand the differences between the two primary aggregation levels, PD and DL, we provide a side-by-side comparison that outlines their core characteristics and use cases. This comparative overview helps clarify how each level impacts the distillation process, particularly in terms of complexity, performance, and robustness.

Table 6 summarizes these two aggregation levels, their functional descriptions, and their distinguishing characteristics.

In addition to these core characteristics, each aggregation level presents unique strengths and potential drawbacks. These practical trade-offs are outlined in Table 7.

3.2. Proposed Method: MPMTD Framework

The proposed MPMTD framework, as illustrated in Figure 6, enhances student learning through an iterative process in which knowledge is progressively distilled from multiple teacher models over several rounds. It employs two main aggregation strategies, fixed and adaptive, that are applied at specific stages to effectively consolidate knowledge from diverse sources.

At each round t, the student model updates its parameters using an aggregation method at a selected aggregation level, either at the DL level or the PD level. These levels have already been detailed in a dedicated section and will be referenced as needed in this section.

3.3. Fixed Aggregation Strategy

The fixed aggregation strategy employs a single aggregation method and a single aggregation level consistently across all rounds. This means that at every round t, the student model receives knowledge from the teachers using the same mathematical aggregation method (e.g., Mean, Median, Trimmed Mean, Krum, and Byzantine-Resilient) applied at a predefined level either DL or PD.

S_{t} = F_{s t u d e n t} (A g g_{fixed} (X_{1}, X_{2}, . . ., X_{n})),

(5)

where

$A g g_{fixed}$ represents a fixed mathematical aggregation method used for all rounds.
$X_{i}$ is the input at the selected aggregation level, meaning the following:
−
If the aggregation level is DL, then $X_{i} = D L_{i}$ (distillation loss from teacher i).
−
If the aggregation level is Proba, then $X_{i} = P r o b a_{i}$ (probability distribution from teacher i).

In Equation (5),

F_{s t u d e n t} (\cdot)

denotes the student model’s learning function, which adapts its parameters based on the aggregated knowledge signal. The input

X_{i}

corresponds to either the distillation loss (

D L_{i}

) or the probability distribution (

P r o b a_{i}

) produced by teacher i, depending on the fixed aggregation level. This formulation covers both PD-level and DL-level fixed strategies, where the aggregation method remains constant across all rounds.

Since both aggregation method and aggregation level are predetermined and do not change over rounds, this strategy ensures stability in knowledge transfer, allowing the student model to converge reliably over multiple rounds.

3.4. Adaptive Multi-Round Aggregation Strategy

Unlike fixed aggregation, the adaptive multi-round strategy dynamically selects aggregation methods across rounds. This provides flexibility, enabling the student to benefit from different aggregation approaches over the course of training. Instead of maintaining a single method, the framework switches between aggregation at the distillation loss (DL) level or the probability distribution (PD) level across rounds.

The following three variations were explored:

Alternating DL and PD aggregation: switching between DL-level aggregation and PD-level aggregation across rounds. For example, DL-based aggregation may be used in round t, followed by PD-based aggregation in round $t + 1$ , and then back to DL.
Iterative aggregation with DL Level: using different aggregation methods (e.g., mean, median, Byzantine, Krum, weighted) in each round while maintaining a consistent aggregation level at DL.
Iterative aggregation with PD Level: similar to the previous approach but consistently using probability distribution–level aggregation (PD) instead of DL.

To ensure mathematical validity, only one type of aggregation is applied per round. If DL aggregation is selected, the training is guided directly by the aggregated loss values. If PD aggregation is selected, the aggregated teacher distribution is first converted into the loss space (using cross-entropy or KL divergence with the student outputs) before training. In this way, the adaptive strategy alternates between DL and PD across rounds but never combines them within the same round. This ensures that all training objectives remain consistent and well-defined in loss space.

4. Experimental Results

In this section, we present the experimental results of different aggregation methods. Each aggregation’s performance is illustrated to provide insights into accuracy trends across rounds. The results highlight the maximum accuracy achieved and the overall stability of the methods.

4.1. Methodology Overview

This study introduces a frugal deep learning framework that leverages KD to transfer knowledge from multiple teacher models to a lightweight student model. The methodology revolves around the selection of an aggregation function and an adaptive training strategy. The aggregation function determines how the knowledge from multiple teacher models is combined before being distilled into the student model, while the adaptive training strategy allows dynamic selection between fixed and adaptive training approaches to optimize the learning process.

Figure 7 illustrates the overall framework of our methodology. The process begins with dataset preprocessing, including grayscale conversion to reduce computational complexity for the student model. Multiple teacher models are trained on RGB images, and their outputs are aggregated according to the selected function. The KD process then transfers the learned representations to the student model, which undergoes either fixed or adaptive training to optimize performance while maintaining efficiency.

4.1.1. Dataset Description

The experiments were conducted using the EDEN Library Dataset [67], a diverse and challenging collection of agricultural images that includes both healthy and diseased plants. This dataset consists of 19 unbalanced classes of high-resolution images (4000 × 3000 pixels), providing detailed visual information essential for classification tasks. The inherent class imbalance and diversity of plant conditions make this dataset a strong benchmark for evaluating the robustness of machine learning models in real-world agricultural settings.

Figure 8 presents representative examples of both healthy and diseased plants from the EDEN Library Dataset, highlighting the dataset’s complexity.

4.1.2. Data Preprocessing

A structured data preprocessing pipeline was implemented to prepare the dataset for training both the student and teacher models. Key preprocessing steps include the following:

Rescaling: Images were resized to align with the model input dimensions.
Normalization: Pixel values were scaled to the [0, 1] range to accelerate convergence and enhance training stability.
Data augmentation: Techniques such as rotation, flipping, and random cropping were applied to enhance generalization.
Grayscale conversion: Images processed for the student model were converted from RGB (224 × 224 × 3) to grayscale (224 × 224 × 1). This step significantly reduces computational complexity while preserving key visual information for classification.

4.1.3. Data Splitting

To ensure consistency in evaluation, the test set was kept fixed across all experiments, while the training and validation sets were randomly sampled in each experiment run. The dataset was split as follows:

Training set: 70% of the dataset (randomly sampled in each run).
Validation set: 15% of the dataset (randomly sampled in each run).
Test set: 15% of the dataset (remains unchanged across experiments).

This approach ensures robustness in model assessment while mitigating data sampling biases.

4.1.4. Student Model Training

The student model, as illustrated in Figure 9, is designed to operate efficiently on grayscale images from the EDEN dataset. Unlike the teacher models that are trained on RGB images, the student model adopts a streamlined architecture, which is a modified and lighter version of MobileNet V1. This adaptation ensures a balance between performance and computational efficiency, making it suitable for scenarios with limited resources. The architecture consists of an initial Conv2D layer followed by multiple MobileNet blocks to progressively extract features. A Global Average Pooling layer is used to reduce spatial dimensions, followed by Dropout layers to mitigate overfitting. The Dense layers apply non-linear transformations, and the final output layer classifies the images based on the number of classes in the dataset. The use of lightweight MobileNet blocks enables the model to retain significant representational power while maintaining efficiency.

4.1.5. Teacher Model Training

The teacher models used for training include MobileNet V1, ResNet50, Xception, and EfficientNetB0. These models were trained on RGB images from the EDEN dataset, achieving high classification accuracy. Their complex architectures allowed them to learn rich feature representations, making them effective knowledge sources for distillation.

During the distillation process, the predictions of these teacher models served as soft targets for the student model. This enabled the student model to leverage the comprehensive feature representations learned by the teachers while maintaining efficiency by operating on grayscale data. The combination of these teacher models ensured a diverse and robust transfer of knowledge to the student model.

4.1.6. Evaluation Metrics

The model’s performance was assessed using standard classification metrics and computational efficiency measures. The primary evaluation criteria include the following:

Accuracy: The proportion of correctly classified samples, computed as follows:

$Accuracy = \frac{T P + T N}{T P + T N + F P + F N}$
F1-score: The harmonic mean of precision and recall, particularly useful for handling imbalanced datasets, is calculated as follows:

$F 1 = 2 \times \frac{P \times R}{P + R}$
Confusion matrix: This provides a granular breakdown of classification performance, capturing True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).
Computational efficiency: Computational efficiency is measured by the following:
−
Parameter count: The total number of trainable parameters.
−
Inference time: The average time required for the model to classify a single image.

4.1.7. Experimental Setup

All experiments were implemented in TensorFlow 2.x (Keras) and executed on an NVIDIA A100-PCIE-40GB GPU. Optimization used Adam with a learning rate of

1 \times 10^{- 4}

. For Multi-Teacher KD, the objective is the weighted sum of student cross-entropy and the mean KL divergence between the student and each teacher’s softened outputs. The distillation hyperparameters

α

and temperature T were tuned empirically, and the best configuration was adopted in the reported results. Training was conducted for 300 epochs per distillation round.

4.2. KD Process

The student model was trained using a response-based cross-modal KD approach, where knowledge is transferred from high-capacity RGB-trained teacher models to a grayscale student model. This technique enables the student to achieve competitive performance while significantly reducing computational complexity.

4.3. Baseline Performance

This section presents the performance of both the teacher model and the student model. The teacher models, trained on full-resolution (

224 \times 224

) RGB images, serve as performance benchmarks for guiding the student during distillation.

Table 8 summarizes the accuracy, F1-score, recall, precision, parameter count, and parameter ratio of the teacher models. The parameter ratio is calculated as follows:

Parameter Ratio = \frac{Parameter Count of Model}{Parameter Count of Student}

(6)

All teacher models achieve accuracy above

99 %

, with MobileNet V1 being the most efficient (

3.2

M parameters,

99.82 %

accuracy). ResNet50 and Xception, though more complex, maintain strong recall and F1-scores. EfficientNetB0 balances model complexity and accuracy, reaching

99.61 %

with

5.3

M parameters.

To evaluate the student model’s baseline performance and its improvement through single-teacher KD, we compare accuracy, F1-score, recall, and precision with and without KD. As shown in Table 9, MobileNetV1-guided distillation significantly boosts performance at 224 × 224 resolution, improving accuracy from 94.31% to 96.85% and F1-score from 94.06% to 96.66%.

On the other hand, the student model without distillation achieves 94.31% accuracy and an F1-score of 94.06%, with recall and precision closely aligned. When distilled from a single MobileNetV1 teacher, the student model’s performance improves to 96.85% accuracy and 96.66% F1-score, clearly demonstrating the benefit of KD. These results confirm that even single-teacher KD significantly enhances the student’s generalization ability, narrowing the performance gap between the lightweight student and the high-capacity teacher.

4.4. Individual Aggregation Analysis

This part presents a comprehensive evaluation of the individual performance of various aggregation methods employed in the MTKD framework. The focus is on analyzing accuracy trends over multiple training rounds, highlighting each method’s strengths and limitations. The highest achieved accuracy for each approach is emphasized with a dashed line in the corresponding visualizations. Among all evaluated techniques, Byzantine-Resilient PD achieved the best overall performance, with an accuracy of 99.29% and an F1-score of 99.27%.

Performance Metrics Overview

To evaluate the effectiveness of different aggregation techniques, we conducted a quantitative performance comparison across key classification metrics. Table 10 presents a comparative analysis of various aggregation techniques, detailing accuracy, recall, precision, and F1-score across the initial and best-performing rounds. Notably, the best round is marked with an asterisk (*), providing insights into how each method evolves during training.

Compared to the single-teacher KD baseline (Table 9), which reached 96.85% accuracy, all MT-KD in Table 10 consistently achieved higher peak accuracies—demonstrating that leveraging multiple teachers through robust aggregation strategies significantly outperforms single-teacher approaches in guiding student learning.

4.5. Accuracy Difference Between Best Teacher and Student over Training Rounds

This subsection analyzes the accuracy difference between the best-performing teacher model and the aggregated student model across different aggregation techniques. The goal is to evaluate how quickly each method minimizes this difference, indicating the student’s learning progress.

4.5.1. Performance Analysis

The following observations summarize the accuracy difference trends over multiple training rounds:

Byzantine-Resilient aggregation achieves the fastest convergence, reducing the accuracy difference to 0.53 at Round 5, indicating strong robustness and efficient knowledge transfer.
Mean aggregation steadily reduces the gap, reaching its minimum difference of 1.06 at Round 6, showing a balanced learning curve.
Weighted aggregation also shows improvement, but it takes longer, with the smallest accuracy difference being 1.95 at Round 7.
Krum aggregation demonstrates slower convergence, with more fluctuations, and a final minimum difference of 2.48 at Round 6.

Overall, these trends suggest that more robust aggregation strategies, particularly Byzantine-Resilient methods, lead to faster convergence and better student performance.

4.5.2. Visualization of Accuracy Difference Trends

Figure 10 illustrates how the accuracy difference between the best teacher and the aggregated student model evolves over training rounds for various aggregation methods.

4.5.3. Optimal Number of Rounds

The results highlight that while all aggregation methods help the student model approach the best teacher’s accuracy, their efficiency varies. Byzantine-Resilient aggregation is the most effective, rapidly reducing the accuracy gap. Mean and Weighted aggregation perform well but take more rounds to achieve optimal convergence, whereas Krum aggregation shows more fluctuation, indicating sensitivity to training conditions.

To provide a structured reference, Table 11 presents the optimal and recommended number of rounds for each aggregation method, based on the round where the accuracy difference from the best teacher reaches its minimum.

These insights emphasize the importance of choosing robust aggregation techniques and tuning training rounds effectively to optimize knowledge transfer and student learning in Multi-Teacher KD frameworks.

4.6. Evaluation of PD vs. DL Aggregation

This section examines the comparative performance of probability-based aggregation (PD) and distillation loss-based aggregation (DL) across different Multi-Teacher KD methods. The objective is to determine which aggregation strategy is optimal for different aggregation techniques.

To provide a structured overview, Table 12 presents the best-performing aggregation level (PD or DL aggregation) for each aggregation method.

4.6.1. Aggregation Methods Where Loss-Based Aggregation Outperforms Probability-Based Aggregation

Figure 11 illustrates the aggregation methods where DL-based (loss-based) aggregation surpasses probability-based aggregation. These methods benefit from optimizing knowledge transfer at the loss function level, resulting in higher stability and improved final accuracy.

4.6.2. Aggregation Methods Where Probability-Based Aggregation Outperforms Loss-Based Aggregation

Figure 12 presents the aggregation methods, where probability-based aggregation achieves superior accuracy compared to loss-based aggregation. These results suggest that, for these methods, directly aggregating teacher probability outputs leads to improved convergence and generalization.

4.6.3. Interpretation of Results

The results from this evaluation provide deeper insight into how different aggregation strategies impact the learning process in Multi-Teacher KD. Unlike the previous section, which focused on individual method performance, this part highlights when and why probability-based or loss-based aggregation performs better.

Loss-based aggregation leads to more stable learning in methods like Krum, Weighted, and Median aggregation. These approaches seem to benefit from refining student training signals at the loss level, allowing the model to learn in a smoother, more structured way.
Probability-based aggregation tends to generalize better for methods like Byzantine Resilient and Mean aggregation. In these cases, relying on the direct probability outputs of multiple teachers provides a more effective guiding signal for the student model, reducing the need for complex loss-based adjustments.
No single strategy is universally best as the effectiveness of an aggregation technique depends on the context. This suggests that a hybrid approach that dynamically switches between probability and loss-based aggregation could be an even better strategy, depending on the training conditions and the presence of adversarial influences.

These insights reinforce the importance of choosing an aggregation method that aligns with the nature of the teacher models and the characteristics of the student network. The next section will explore how these strategies influence the overall stability and convergence of the distillation process.

4.7. Evaluation of PL-Level and DL-Level Aggregation Strategies

To assess the effectiveness of different aggregation strategies, we analyze the accuracy and F1-score gaps between the student models and the best teacher models over multiple training rounds. The best teacher models were selected based on the highest accuracy and F1-score from Table 8. Specifically, we compared the student models against MobileNet V1 for accuracy, as it had the highest accuracy (99.82%), and ResNet50 for the F1-score, as it demonstrated the best F1-score (99.65%).

4.7.1. Performance of PL-Level Aggregation

PL-Level aggregation combines the soft probability outputs of multiple teachers to guide the student’s learning. It offers a consensus target but may be affected by conflicting teacher predictions.

We assess its impact by analyzing accuracy and F1-score across training rounds.

Figure 13 and Figure 14 illustrate the accuracy and F1-score differences when applying PL-level aggregation techniques. This evaluation captures how close the student models can approximate the performance of the best teachers over multiple training rounds.

PL-level aggregation exhibits varying levels of effectiveness across different methods. Byzantine-Resilient and Weighted aggregations consistently minimize the accuracy and F1-score gaps, demonstrating strong resilience against adversarial conditions. In contrast, Krum-based aggregation methods exhibit greater fluctuations, which suggests sensitivity to noisy updates. The Median aggregation approach stabilizes the performance gap over training rounds but does not always yield the lowest discrepancy from the teacher models.

4.7.2. Performance of DL-Level Aggregation

DL-Level aggregation focuses on combining the individual distillation losses between each teacher and the student, rather than merging their predictions directly. This approach emphasizes optimization signals, offering better control and robustness, especially when teacher outputs are diverse or conflicting.

We evaluate its effectiveness by analyzing the accuracy and F1-score differences between the student and the best teacher over multiple training rounds.

Figure 15 and Figure 16 present the accuracy and F1-score differences using DL-level aggregation strategies. These methods generally show a more stable and controlled reduction in the teacher–student performance gap compared to PL-level aggregation.

Among the different techniques, Krum and Median aggregations demonstrate their effectiveness in reducing both accuracy and F1-score differences, reinforcing their potential for convergence in distillation-based learning settings. The Byzantine-Resilient strategy maintains a steady reduction in the teacher–student gap, highlighting its robustness in handling diverse teacher outputs. Additionally, Weighted aggregation consistently performs well, further supporting its adaptability in multi-teacher learning frameworks.

Overall, DL-level aggregation strategies exhibit greater stability in reducing the discrepancy with the best teacher models, making them promising approaches for improving student model performance in resource-constrained environments.

4.8. Multi-Round Aggregation with Varying Techniques per Round

As introduced in the Proposed Methodology section, this experiment evaluates the impact of applying different aggregation techniques across multiple rounds of training. Unlike traditional approaches that use a single aggregation method throughout training, this strategy alternates between aggregation methods at each round to assess their influence on convergence, stability, and final accuracy.

To analyze the effect of varying aggregation techniques, we conduct two separate experiments: one where aggregation techniques change across both levels (mixed-level aggregation) and another where aggregation remains at the same level throughout training (same-level aggregation).

4.8.1. Mixed-Level Aggregation Across Rounds

In this experiment, we explore the effects of dynamically switching both the aggregation technique and aggregation level across training rounds. Unlike traditional KD approaches that use a fixed aggregation strategy throughout training, this setup alternates between different aggregation methods in each round. For instance, one round may employ Byzantine-Resilient DL aggregation, while another round may use Krum-based DL aggregation. This approach aims to assess whether strategically adapting aggregation techniques enhances knowledge transfer and improves student model performance.

Table 13 summarizes the performance metrics of each round, highlighting the accuracy, weighted precision, weighted recall, and weighted F1-score achieved using different aggregation methods.

The results show that the Byzantine-Resilient Probability aggregation achieves one of the highest accuracies (98.23%), demonstrating its robustness against noisy teacher models. Additionally, Krum-based DL aggregation further enhances performance, reaching an accuracy of 98.76%. The final experiment using Mean-based DL aggregation leads to the best overall accuracy of 98.90%, indicating that some aggregation techniques may generalize better in later training rounds.

This study highlights the benefits of employing a multi-round, adaptive aggregation strategy, as different methods contribute uniquely to performance improvements over time. The next section examines the case where aggregation is kept at the same level across all training rounds.

4.8.2. Same-Level Aggregation Across Rounds

In contrast to the mixed-level approach, this experiment maintains a consistent aggregation level throughout training but evaluates two separate cases:

DL Aggregation: Aggregation is consistently applied at the distillation loss-level across all rounds. This experiment examines whether DL-based aggregation leads to smoother convergence and better generalization.
Probability Distribution Aggregation: Aggregation is performed at the probability distribution level in every round. This experiment assesses whether directly averaging teacher probability outputs results in superior model performance.

The results from these two cases are presented in Table 14 and Table 15. These tables summarize the accuracy, recall, precision, and F1-score across multiple rounds.

Table 14 presents the performance results for DL-based aggregation. The results indicate that DL aggregation provides a consistent performance increase across training rounds, showing stability and gradual improvement.

Table 15 presents the results for probability-based aggregation, showing gradual improvements across training rounds. The results indicate that directly aggregating teacher probability outputs leads to competitive performance gains.

Both experiments demonstrate that maintaining a fixed aggregation level contributes to steady model improvements. The results provide insights into how different aggregation methods influence student model performance in a stable training setting.

5. Discussion

The experiments carried out in this study evaluated the performance of several aggregation techniques within the MPMTD framework. The findings show that certain approaches, such as Byzantine-Resilient and Weighted aggregations, demonstrated strong stability and better generalization across tasks. By contrast, Krum-based aggregation, while robust in noisy environments, produced more variable outcomes.

A more fine-grained comparison between probability distribution–based aggregation and distillation loss–based aggregation revealed an interesting pattern. For methods such as Krum and Weighted, distillation loss–based aggregation provided consistent improvements. In contrast, probability-based aggregation aligned more naturally with approaches like Byzantine-Resilient and Mean, producing stronger results in those cases. These observations suggest that no single aggregation paradigm dominates universally; rather, the effectiveness of each depends on the interaction between aggregation type and the underlying method.

Another noteworthy result emerged from the use of adaptive multi-round aggregation. Allowing the student model to shift aggregation strategies across successive rounds led to improved convergence. This dynamic adaptation enabled the student to absorb complementary strengths from different techniques while reducing exposure to their weaknesses, resulting in a more balanced learning trajectory.

5.1. Advantages and Contributions

Overall, the proposed MPMTD approach contributes in several meaningful ways. First, it introduces progressive learning through multi-round distillation, where iterative rounds combined with varied aggregation strategies improve the student model’s ability to assimilate knowledge. Second, the framework shows resilience in the presence of noisy or unreliable teachers: techniques such as Byzantine-Resilient aggregation effectively suppress conflicting signals and highlight reliable ones. A further strength lies in its flexibility, as the framework accommodates both probability-based and distillation loss-based aggregation, offering adaptability for different problem contexts. Finally, the empirical results confirm that strategies like weighted distillation loss aggregation and Byzantine-Resilient aggregation lead to notable improvements in accuracy, robustness, and generalization.

5.2. Limitations and Challenges

Despite these benefits, several challenges remain. The computational cost of conducting multi-round distillation with diverse aggregation strategies is non-trivial and may hinder deployment in resource-constrained environments such as edge devices or IoT systems. Another limitation arises from the dependence on teacher selection: if the teacher models lack diversity or contain biases, the student is at risk of inheriting those weaknesses. Moreover, determining the appropriate number of distillation rounds is still unresolved; too few rounds limit knowledge transfer, while too many increase the risk of overfitting. Finally, although some strategies emerged as stronger overall performers, no universal “best” aggregation method was observed; the optimal choice appears to depend on the characteristics of both the teachers and the data.

5.3. Real-World Applications

Beyond methodological insights, the results have direct relevance for real-world deployment. For Internet of Things (IoT) devices, the robustness of Byzantine-Resilient aggregation helps mitigate unreliable or noisy sensor nodes, thereby improving the stability of applications such as smart healthcare monitoring and industrial automation. In unmanned aerial vehicles (UAVs), the adaptive multi-round strategy is particularly beneficial under strict resource and latency constraints, as it allows lightweight yet accurate models to be deployed onboard for tasks such as real-time defect detection, navigation, and environmental monitoring. Similarly, in robotics, where perception and decision-making must remain reliable in dynamic and uncertain environments, the generalization gains observed with Weighted and distillation loss–based aggregation ensure more consistent performance. These connections illustrate that the MPMTD framework is not only an academic contribution but also a practical enabler of efficient, reliable AI deployment across IoT, UAV, and robotic systems.

5.4. Future Research Directions

These findings suggest several promising directions for future work. First, to fully establish the generalizability and practical impact of this approach, validation across a wider range of benchmark datasets and rigorous performance benchmarking on physical resource-constrained edge devices will be essential. Furthermore, ablation studies are needed to quantify the individual contribution of key components, such as the number of teachers, the cross-modal setting, and the choice of aggregation level. On the methodological front, promising directions include the development of self-adaptive mechanisms that allow the framework to choose aggregation strategies dynamically in response to the student’s performance and the application of meta-learning techniques to optimize distillation schedules, thereby reducing reliance on manual experimentation. Insights from Federated Learning, such as momentum-based variants of FedAvg, may also help improve resistance to adversarial or unreliable teachers. Extending the framework into distributed and decentralized environments, particularly cloud-edge ecosystems, could further enhance scalability. Finally, a deeper investigation into the role of teacher diversity, both in terms of architecture and domain knowledge, would provide valuable guidance on how best to configure multi-teacher distillation systems.

6. Conclusions

The MPMTD framework significantly advances KD, leveraging dynamic and robust aggregation strategies. Experimental results indicate that multi-round adaptive aggregation, combined with techniques like Byzantine-Resilient and Weighted aggregation, improves convergence speed and model accuracy. Notably, to the best of our knowledge, this work is the first to apply Byzantine-Resilient aggregation in a KD setting, where it achieved the best overall performance with an accuracy of 99.29% and an F1-score of 99.27%. However, challenges remain, particularly regarding automatic optimization of distillation parameters and reducing computational overhead.

Future research should focus on self-adaptive aggregation mechanisms and federated-inspired learning techniques to further enhance robustness and efficiency. In conclusion, this study paves the way for more flexible and high-performance KD strategies, with promising applications in edge AI, IoT, and distributed learning environments.

Author Contributions

Conceptualization, A.H. and H.N.N.; methodology, A.H. and H.N.N.; investigation, A.H. and H.N.N.; writing—original draft preparation, A.H. and H.N.N.; writing—review and editing, A.H. and H.N.N.; visualization, A.H. and H.N.N.; supervision, H.N.N. and J.A.; project administration, H.N.N. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported by funds from the Bourgogne-Franche-Comté Region (AAP Région 2022 dispositif ANER-IA Limentaire project) and by the EIPHI Graduate School (contract “ANR-17-EURE-0002”).

Data Availability Statement

No new data were created or analyzed in this study. Data are contained within the article.

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT (GPT-5, OpenAI) for the purpose of improving the clarity of language. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
ML	Machine Learning
KD	Knowledge Distillation
CNN	Convolutional Neural Network
IoT	Internet of Things
MT-KD	Multi-Teacher Knowledge Distillation
MPMTD	Multi-Round Parallel Multi-Teacher Distillation
CA-MKD	Confidence-Aware Multi-Teacher KD
DL	Distillation Loss
PD	Probability Distribution
PL	Probability Level
KL	Kullback–Leibler
RGB	Red, Green, Blue (Color Channels)
DDKD	Direct Data Knowledge Distillation
SKD	Self-Knowledge Distillation
TS	Teacher–Student
ViTs	Vision Transformers

References

Sharifani, K.; Amini, M. Machine learning and deep learning: A review of methods and applications. World Inf. Technol. Eng. J. 2023, 10, 3897–3904. [Google Scholar]
Menghani, G. Efficient deep learning: A survey on making deep learning models smaller, faster, and better. ACM Comput. Surv. 2023, 55, 1–37. [Google Scholar] [CrossRef]
Capra, M.; Peloso, R.; Masera, G.; Ruo Roch, M.; Martina, M. Edge computing: A survey on the hardware requirements in the internet of things world. Future Internet 2019, 11, 100. [Google Scholar] [CrossRef]
Kim, H.; Khan, M.U.K.; Kyung, C.M. Efficient neural network compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12569–12577. [Google Scholar]
Phuong, M.; Lampert, C. Towards understanding knowledge distillation. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 5142–5151. [Google Scholar]
Chen, C.; Zhang, P.; Zhang, H.; Dai, J.; Yi, Y.; Zhang, H.; Zhang, Y. Deep learning on computational-resource-limited platforms: A survey. Mob. Inf. Syst. 2020, 2020, 8454327. [Google Scholar] [CrossRef]
Liang, X.; Wu, L.; Li, J.; Qin, T.; Zhang, M.; Liu, T.Y. Multi-teacher distillation with single model for neural machine translation. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 992–1002. [Google Scholar] [CrossRef]
Ye, X.; Jiang, R.; Tian, X.; Zhang, R.; Chen, Y. Knowledge Distillation via Multi-Teacher Feature Ensemble. IEEE Signal Process. Lett. 2024, 31, 566–570. [Google Scholar] [CrossRef]
Zhu, E.Y.; Zhao, C.; Yang, H.; Li, J.; Wu, Y.; Ding, R. A Comprehensive Review of Knowledge Distillation-Methods, Applications, and Future Directions. Int. J. Innov. Res. Comput. Sci. Technol. 2024, 12, 106–112. [Google Scholar] [CrossRef]
Long, J.; Yin, Z.; Han, Y.; Huang, W. MKDAT: Multi-Level Knowledge Distillation with Adaptive Temperature for Distantly Supervised Relation Extraction. Information 2024, 15, 382. [Google Scholar] [CrossRef]
Jiang, Y.; Feng, C.; Zhang, F.; Bull, D. MTKD: Multi-teacher knowledge distillation for image super-resolution. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 364–382. [Google Scholar]
Zhang, T.; Liu, Y. MTUW-GAN: A Multi-Teacher Knowledge Distillation Generative Adversarial Network for Underwater Image Enhancement. Appl. Sci. 2024, 14, 529. [Google Scholar] [CrossRef]
Krichen, M.; Abdalzaher, M.S.; Shaaban, M.; Aburukba, R. Lightweight AI for Drones: A Survey. In Proceedings of the 2025 7th International Youth Conference on Radio Electronics, Electrical and Power Engineering (REEPE), Moscow, Russia, 8–10 April 2025; pp. 1–6. [Google Scholar]
Li, W.; Hu, D.; Wang, D. The Design of a Lightweight AI Robot for Logistics Handling. In Proceedings of the 2024 10th International Conference on Mechanical and Electronics Engineering (ICMEE), Xi’an, China, 27–29 December 2024; pp. 195–200. [Google Scholar]
Chen, L.; Ding, Q.; Zou, Q.; Chen, Z.; Li, L. DenseLightNet: A light-weight vehicle detection network for autonomous driving. IEEE Trans. Ind. Electron. 2020, 67, 10600–10609. [Google Scholar] [CrossRef]
Sipola, T.; Alatalo, J.; Kokkonen, T.; Rantonen, M. Artificial intelligence in the IoT era: A review of edge AI hardware and software. In Proceedings of the 2022 31st Conference of Open Innovations Association (FRUCT), Helsinki, Finland, 27–29 April 2022; pp. 320–331. [Google Scholar]
Asal, B.; Can, A.B. Ensemble-Based Knowledge Distillation for Video Anomaly Detection. Appl. Sci. 2024, 14, 1032. [Google Scholar] [CrossRef]
Nanayakkara, S.I.; Pokhrel, S.R.; Li, G. Understanding global aggregation and optimization of federated learning. Future Gener. Comput. Syst. 2024, 159, 114–133. [Google Scholar] [CrossRef]
Khraisat, A.; Alazab, A.; Singh, S.; Jan, T.; Gomez, A., Jr. Survey on federated learning for intrusion detection system: Concept, architectures, aggregation strategies, challenges, and future directions. ACM Comput. Surv. 2024, 57, 1–38. [Google Scholar] [CrossRef]
Ni, L.; Gong, X.; Li, J.; Tang, Y.; Luan, Z.; Zhang, J. rfedfw: Secure and trustable aggregation scheme for byzantine-robust federated learning in internet of things. Inf. Sci. 2024, 653, 119784. [Google Scholar] [CrossRef]
Cheng, X.; Zhang, Z.; Weng, W.; Yu, W.; Zhou, J. DE-MKD: Decoupled Multi-Teacher Knowledge Distillation Based on Entropy. Mathematics 2024, 12, 1672. [Google Scholar] [CrossRef]
Mu, X.; Antwi-Afari, M.F. The applications of Internet of Things (IoT) in industrial management: A science mapping review. Int. J. Prod. Res. 2024, 62, 1928–1952. [Google Scholar] [CrossRef]
Hassanalian, M.; Abdelkefi, A. Classifications, applications, and design challenges of drones: A review. Prog. Aerosp. Sci. 2017, 91, 99–131. [Google Scholar] [CrossRef]
Reddy, N.V.; Reddy, A.; Pranavadithya, S.; Kumar, J.J. A critical review on agricultural robots. Int. J. Mech. Eng. Technol. 2016, 7, 183–188. [Google Scholar]
Hoffpauir, K.; Simmons, J.; Schmidt, N.; Pittala, R.; Briggs, I.; Makani, S.; Jararweh, Y. A survey on edge intelligence and lightweight machine learning support for future applications and services. Acm J. Data Inf. Qual. 2023, 15, 1–30. [Google Scholar] [CrossRef]
Hu, X.; Chu, L.; Pei, J.; Liu, W.; Bian, J. Model complexity of deep learning: A survey. Knowl. Inf. Syst. 2021, 63, 2585–2619. [Google Scholar] [CrossRef]
Waisberg, E.; Ong, J.; Masalkhi, M.; Kamran, S.A.; Zaman, N.; Sarker, P.; Lee, A.G.; Tavakkoli, A. GPT-4: A new era of artificial intelligence in medicine. Ir. J. Med. Sci. (1971-) 2023, 192, 3197–3200. [Google Scholar] [CrossRef]
Raghu, M.; Unterthiner, T.; Kornblith, S.; Zhang, C.; Dosovitskiy, A. Do vision transformers see like convolutional neural networks? Adv. Neural Inf. Process. Syst. 2021, 34, 12116–12128. [Google Scholar]
Gao, Y.; Liu, Y.; Zhang, H.; Li, Z.; Zhu, Y.; Lin, H.; Yang, M. Estimating GPU memory consumption of deep learning models. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, 8–13 November 2020; pp. 1342–1352. [Google Scholar]
Sree, S.R.; Vyshnavi, S.; Jayapandian, N. Real-world application of machine learning and deep learning. In Proceedings of the 2019 International Conference on Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India, 27–29 November 2019; pp. 1069–1073. [Google Scholar]
Yurtsever, E.; Lambert, J.; Carballo, A.; Takeda, K. A survey of autonomous driving: Common practices and emerging technologies. IEEE Access 2020, 8, 58443–58469. [Google Scholar] [CrossRef]
Liu, Z. Fermatean fuzzy similarity measures based on Tanimoto and Sørensen coefficients with applications to pattern classification, medical diagnosis and clustering analysis. Eng. Appl. Artif. Intell. 2024, 132, 107878. [Google Scholar] [CrossRef]
Kaleem, S.; Sohail, A.; Babar, M.; Ahmad, A.; Tariq, M.U. A hybrid model for energy-efficient Green Internet of Things enabled intelligent transportation systems using federated learning. Internet Things 2024, 25, 101038. [Google Scholar] [CrossRef]
Cheng, H.; Zhang, M.; Shi, J.Q. A survey on deep neural network pruning: Taxonomy, comparison, analysis, and recommendations. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10558–10578. [Google Scholar] [CrossRef]
Lang, J.; Guo, Z.; Huang, S. A comprehensive study on quantization techniques for large language models. In Proceedings of the 2024 4th International Conference on Artificial Intelligence, Robotics, and Communication (ICAIRC), Xiamen, China, 27–29 December 2024; pp. 224–231. [Google Scholar]
Cai, G.; Li, J.; Liu, X.; Chen, Z.; Zhang, H. Learning and compressing: Low-rank matrix factorization for deep neural network compression. Appl. Sci. 2023, 13, 2704. [Google Scholar] [CrossRef]
Zhou, Y.; Chen, S.; Wang, Y.; Huan, W. Review of research on lightweight convolutional neural networks. In Proceedings of the 2020 IEEE 5th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 12–14 June 2020; pp. 1713–1720. [Google Scholar]
Gou, J.; Yu, B.; Maybank, S.; Tao, D. Knowledge Distillation: A Survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
Zhang, Y.; Zhou, G.; Xie, Z.; Ma, J.; Huang, J.X. A diversity-enhanced knowledge distillation model for practical math word problem solving. Inf. Process. Manag. 2025, 62, 104059. [Google Scholar] [CrossRef]
Zhao, B.; Cui, Q.; Song, R.; Qiu, Y.; Liang, J. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11953–11962. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Zhang, H.; Chen, D.; Wang, C. Adaptive Multi-Teacher Knowledge Distillation with Meta-Learning. arXiv 2023, arXiv:2306.06634. [Google Scholar] [CrossRef]
Loussaief, E.B.; Rashwan, H.A.; Ayad, M.; Khalid, A.; Puig, D. Adaptive weighted multi-teacher distillation for efficient medical imaging segmentation with limited data. Knowl.-Based Syst. 2025, 315, 113196. [Google Scholar] [CrossRef]
Xu, L.; Wang, Z.; Bai, L.; Ji, S.; Ai, B.; Wang, X.; Philip, S.Y. Multi-Level Knowledge Distillation with Positional Encoding Enhancement. Pattern Recognit. 2025, 163, 111458. [Google Scholar] [CrossRef]
Yang, C.; An, Z.; Cai, L.; Xu, Y. Hierarchical self-supervised augmented knowledge distillation. arXiv 2021, arXiv:2107.13715. [Google Scholar]
Li, W.; Wang, J.; Ren, T.; Li, F.; Zhang, J.; Wu, Z. Learning accurate, speedy, lightweight CNNs via instance-specific multi-teacher knowledge distillation for distracted driver posture identification. IEEE Trans. Intell. Transp. Syst. 2022, 23, 17922–17935. [Google Scholar] [CrossRef]
Ma, Y.; Jiang, X.; Guan, N.; Yi, W. Anomaly detection based on multi-teacher knowledge distillation. J. Syst. Archit. 2023, 138, 102861. [Google Scholar] [CrossRef]
Bai, Y.; Wang, Z.; Xiao, J.; Wei, C.; Wang, H.; Yuille, A.L.; Zhou, Y.; Xie, C. Masked autoencoders enable efficient knowledge distillers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 24256–24265. [Google Scholar]
Zhang, Y.; Xiang, T.; Hospedales, T.M.; Lu, H. Deep Mutual Learning. In Proceedings of the CVPR, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Liu, Y.; Cao, J.; Li, B.; Yuan, C.; Hu, W.; Li, Y.; Duan, Y. Knowledge distillation via instance relationship graph. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7096–7104. [Google Scholar]
Liu, Y.; Zhang, K.; Hou, C.; Wang, J.; Ji, R. Adaptive multi-teacher multi-level knowledge distillation. Neurocomputing 2020, 415, 106–113. [Google Scholar] [CrossRef]
Zhang, L.; Bao, C.; Ma, K. Self-distillation: Towards efficient and compact neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4388–4403. [Google Scholar] [CrossRef]
Zhang, H.; Chen, D.; Wang, C. Confidence-aware multi-teacher knowledge distillation. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 4498–4502. [Google Scholar]
Cao, S.; Li, M.; Hays, J.; Ramanan, D.; Wang, Y.X.; Gui, L. Learning lightweight object detectors via multi-teacher progressive distillation. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 3577–3598. [Google Scholar]
Mukherjee, S.; Awadallah, A. XtremeDistil: Multi-stage distillation for massive multilingual models. arXiv 2020, arXiv:2004.05686. [Google Scholar]
Lee, Y.; Wu, W. QMKD: A Two-Stage Approach to Enhance Multi-Teacher Knowledge Distillation. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–7. [Google Scholar]
Sun, T.; Li, D.; Wang, B. Decentralized federated averaging. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4289–4301. [Google Scholar] [CrossRef]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
Wang, J.; Liu, Q.; Liang, H.; Joshi, G.; Poor, H.V. A novel framework for the analysis and design of heterogeneous federated learning. IEEE Trans. Signal Process. 2021, 69, 5234–5249. [Google Scholar] [CrossRef]
Fallah, A.; Mokhtari, A.; Ozdaglar, A. Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach. Adv. Neural Inf. Process. Syst. 2020, 33, 3557–3568. [Google Scholar]
Arivazhagan, M.G.; Aggarwal, V.; Singh, A.K.; Choudhary, S. Federated learning with personalization layers. arXiv 2019, arXiv:1912.00818. [Google Scholar] [CrossRef]
Wei, W.; Liu, L.; Wu, Y.; Su, G.; Iyengar, A. Gradient-leakage resilient federated learning. In Proceedings of the 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS), Washington, DC, USA, 7–10 July 2021; pp. 797–807. [Google Scholar]
De Carvalho, M. Mean, what do you Mean? Am. Stat. 2016, 70, 270–274. [Google Scholar] [CrossRef]
Dor, D.; Zwick, U. Selecting the median. Siam J. Comput. 1999, 28, 1722–1758. [Google Scholar] [CrossRef]
Taheri, R.; Arabikhan, F.; Gegov, A.; Akbari, N. Robust aggregation function in federated learning. In Proceedings of the International Conference on Information and Knowledge Systems, Portsmouth, UK, 22–23 June 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 168–175. [Google Scholar]
So, J.; Güler, B.; Avestimehr, A.S. Byzantine-resilient secure federated learning. IEEE J. Sel. Areas Commun. 2020, 39, 2168–2181. [Google Scholar] [CrossRef]
Mylonas, N.; Malounas, I.; Mouseti, S.; Vali, E.; Espejo-Garcia, B.; Fountas, S. Eden library: A long-term database for storing agricultural multi-sensor datasets from uav and proximal platforms. Smart Agric. Technol. 2022, 2, 100028. [Google Scholar] [CrossRef]

Figure 1. Taxonomy of lightweight machine learning techniques, including neural network compression and KD.

Figure 2. Illustration of the Successive aggregation approach. The student learns sequentially from multiple teachers in different steps, progressively distilling knowledge.

Figure 3. Illustration of the Parallel aggregation approach. The student aggregates knowledge from multiple teachers simultaneously through different aggregation mechanisms.

Figure 4. Aggregation at the PD Level. This figure illustrates how multiple teacher models’ probability distributions are combined to guide the student model’s learning process.

Figure 5. Aggregation at the DL Level. This figure demonstrates how distillation losses from multiple teachers are aggregated to guide the student model’s learning process.

Figure 6. Illustration of the proposed framework, showing MPMTD. Knowledge from multiple teachers is iteratively transferred to the student model using both fixed and adaptive aggregation strategies.

Figure 7. Methodology Overview: The proposed framework leverages multiple teacher models and a KD process to train a frugal student model. The configuration includes the selection of an aggregation function and an adaptive training strategy to optimize the model’s learning process.

Figure 8. Sample images from the EDEN Library Dataset, illustrating variations in plant health conditions: (a) Broccoli; (b) Chinese cabbage; (c) Mandarin tree; (d) Zucchini.

Figure 9. Student model architecture.

Figure 10. Accuracy difference trends for different aggregation methods over training rounds. The decreasing gap indicates improved student learning. The dashed red line marks the minimum accuracy difference achieved.

Figure 11. Comparison of accuracy trends for aggregation methods where loss-based aggregation outperforms probability-based aggregation. The DL-based versions demonstrate higher stability and better final accuracy.

Figure 12. Comparison of accuracy trends for aggregation methods where probability-based aggregation outperforms loss-based aggregation. These methods exhibit better peak accuracy when aggregating teacher outputs at the probability distribution level.

Figure 13. PL-Level Aggregation: Accuracy difference from the best teacher over training rounds for different methods.

Figure 14. PL-Level Aggregation: F1-score difference from the best teacher over training rounds for different methods.

Figure 15. DL-Level Aggregation: Accuracy difference from the best teacher over training rounds for different methods.

Figure 16. DL-Level Aggregation: F1-score difference from the best teacher over training rounds for different methods.

Table 1. Table of variables.

Variable	Description
$T$	Set of N teacher models.
$T_{i}$	The i-th teacher model.
$P_{T_{i}} (x)$	Probability distribution produced by teacher $T_{i}$ for input x.
$P_{agg} (x)$	Aggregated probability distribution from all teachers for input x.
N	Number of teacher models.
$w_{i}$	Weight assigned to teacher $T_{i}$ in Weighted Mean Aggregation.
$d (T_{i}, T_{j})$	Pairwise Euclidean distance between predictions of teachers $T_{i}$ and $T_{j}$ .
$S (T_{i})$	Sum of distances from teacher $T_{i}$ to its nearest $N - f - 2$ teachers.
f	Number of faulty or adversarial teachers.
$T^{*}$	Selected teacher with the smallest sum of distances in Krum Aggregation.
c	Number of extreme teacher predictions to trim in Byzantine-Resilient aggregation.
$O$	Set of the highest and lowest c teacher predictions in Byzantine-Resilient Aggregation.
$α$	Weighting factor balancing cross-entropy loss and KL divergence loss in KD.
$y_{true}$	True labels for the input data.
$y_{student}$	Predictions of the student model with temperature $T = 1$ .
$P_{teacher}^{T}$	Softened output probabilities of the teacher model with temperature T.
$P_{student}^{T}$	Softened output probabilities of the student model with temperature T.
CE	Cross-entropy loss function.
KL	Kullback–Leibler divergence loss function.
$λ_{t}$	Adaptive coefficient balancing aggregation methods at distillation loss (DL) and probability distribution (Proba) levels in adaptive aggregation.
$A g g_{fixed}$	Fixed mathematical aggregation method used in fixed aggregation strategy.
$X_{i}$	Input at the selected aggregation level (either distillation loss $D L_{i}$ or probability distribution $P r o b a_{i}$ ).
$S_{t}$	Student model parameters at round t.
$A g g_{t}$	Aggregation method applied at round t in adaptive aggregation.

Table 2. Challenges in lightweight edge machine learning.

Issue	Description
Scalability	The difficulty of deploying models across distributed or federated systems, where bandwidth and computational resources are unevenly distributed.
Hardware Constraints	The limited computational capabilities of edge devices and embedded systems.
Cost-Efficiency	The financial and environmental cost of running large-scale models in production.

Table 3. Summary of neural network compression techniques.

Technique	Description	Advantages	Disadvantages
Pruning	Removing less important connections/neurons	Reduces model size and computational cost	Can require retraining; unstructured pruning can be difficult to accelerate on hardware
Quantization	Reducing precision of weights/activations	Reduces memory footprint and can improve inference speed	Can lead to accuracy loss if not performed carefully
Low-Rank Factorization	Decomposing weight matrices	Reduces the number of parameters	Can be computationally expensive; may not be suitable for all architectures
Compact Architectures	Designing efficient architectures	Reduces model size and computational cost from the outset	Requires careful architectural design; may not achieve the same accuracy as larger models
KD	Transferring knowledge from a large teacher to a smaller student	Can improve student performance beyond training from scratch	Requires training a large teacher model

Table 4. Steps in Pre-Aggregation, Aggregation, and Post-Aggregation for different techniques.

Aggregation Technique	Pre-Aggregation	Aggregation	Post-Aggregation
Mean	Collect all teacher outputs	Compute average of values	Store and analyze trends
Median	Sort data	Select middle value	Reduce impact of outliers
Weighted Mean	Assign weights based on importance	Compute weighted sum	Adjust based on reliability scores
Krum	Compute pairwise distances between predictions	Select prediction closest to majority	Improve robustness against adversarial inputs
Byzantine-Resilient	Sort and identify extreme values	Remove top and bottom outliers, then compute mean	Ensure robustness against corrupted teachers

Table 5. Summary of aggregation techniques: description, steps, and mathematical formulation.

Aggregation Technique	Description	Steps	Mathematical Formulation
Mean [63]	Computes the average of teacher predictions. This method is computationally efficient but sensitive to outliers.	Collect predictions from all teachers $T_{1}, T_{2}, \dots, T_{N}$ . Compute the arithmetic mean of the predictions.	$P_{agg} (x) = \frac{1}{N} \sum_{i = 1}^{N} P_{T_{i}} (x)$
Median [64]	Uses the median of teacher predictions to reduce sensitivity to outliers. This method is robust to extreme values.	Collect predictions from all teachers $T_{1}, T_{2}, \dots, T_{N}$ . Sort the predictions and select the middle value.	$P_{agg} (x) = median {P_{T_{1}} (x), \dots, P_{T_{N}} (x)}$
Weighted Mean [51]	Assigns weights to teachers based on their reliability or accuracy. This method prioritizes more reliable teachers.	Assign weights $w_{i}$ to each teacher $T_{i}$ , where $\sum_{i = 1}^{N} w_{i} = 1$ . Compute the weighted sum of the predictions.	$P_{agg} (x) = \sum_{i = 1}^{N} w_{i} P_{T_{i}} (x)$
Krum [65]	Selects the teacher whose prediction is closest to the majority. This method is robust to adversarial or unreliable teachers.	Compute pairwise Euclidean distances between teacher predictions. For each teacher, sum the distances to the nearest $N - f - 2$ teachers. Select the teacher with the smallest sum of distances.	$T^{*} = arg min_{T_{i}} S (T_{i}), S (T_{i}) = \sum_{T_{j} \in N_{i}} d (T_{i}, T_{j})$
Byzantine-Resilient [66]	Removes a fraction of extreme teacher predictions before computing the mean. This method is robust to adversarial or unreliable teachers.	Sort teacher predictions along the teacher axis. Remove the top c and bottom c predictions. Compute the mean of the remaining predictions.	$P_{agg} (x) = \frac{1}{N - 2 c} \sum_{i \notin O} P_{T_{i}} (x)$

Table 6. Comparison of aggregation levels in MT-KD.

Aggregation Level	Description	Characteristics
PD	Combines soft predictions from teachers into a single target distribution for the student	Simplifies the learning process but may dilute teacher-specific insights. Suitable for consistent teacher outputs.
DL	Computes and combines the distillation loss for each teacher independently	Preserves teacher diversity and allows flexibility but can be computationally intensive. Effective for heterogeneous teacher outputs.

Table 7. Advantages and limitations of aggregation levels.

Aggregation Level	Advantages	Limitations
PD	Simplifies the training process. Provides a unified target for the student. Suitable for consistent teacher outputs.	May dilute unique teacher insights. Assumes homogeneity in teacher predictions.
DL	Preserves individual teacher contributions. More effective for diverse or heterogeneous teacher outputs. Allows flexibility in handling teacher diversity.	Computationally expensive. Requires careful optimization to balance multiple loss terms.

Table 8. Baseline performance of teacher models.

Teacher Model	Accuracy (%)	F1-Score (%)	Recall (%)	Precision (%)	Parameter Count	Parameter Ratio (to Student)
MobileNet V1	99.82	98.2	98.5	98.4	3,220,000	9.99
ResNet50	99.64	99.65	99.7	99.6	23,590,000	73.2
Xception	99.46	99.36	99.4	99.3	22,910,480	71.1
EfficientNetB0	99.61	98.6	98.9	98.7	5,300,000	16.45

Table 9. Comparison of student performance Without KD and With Single-Teacher KD (MobileNetV1) at 224 × 224 resolution.

Setting	Accuracy (%)	F1-Score (%)	Recall (%)	Precision (%)
Without KD	94.31	94.06	94.10	94.00
With KD (one Teacher)	96.85	96.66	96.70	96.62

Table 10. Performance metrics for different aggregation methods.

Aggregation/Teacher	Round Number	Metrics (%)
Aggregation/Teacher	Round Number	Accuracy	Recall	Precision	F1-Score
Best Teacher	-	99.82	98.5	98.4	98.2
Krum-Based PD	1	95.92	95.32	95.18	95.77
Krum-Based PD	13 *	97.52	97.12	97.24	97.38
Krum-Based DL	1	97.16	96.78	96.85	97.13
Krum-Based DL	6 *	98.76	98.45	98.62	98.73
Byzantine-Resilient DL	1	98.40	98.11	98.22	98.32
Byzantine-Resilient DL	2 *	98.58	98.47	98.51	98.50
Byzantine-Resilient PD	1	97.87	97.64	97.72	97.79
Byzantine-Resilient PD	5 *	99.29	99.18	99.21	99.27
Mean-Based PD	1	95.92	95.56	95.71	95.89
Mean-Based PD	6 *	98.76	98.63	98.67	98.68
Mean-Based DL	1	95.21	94.85	94.81	94.66
Mean-Based DL	10 *	98.40	98.31	98.33	98.29
Median-Based PD	1	95.92	95.50	95.63	95.60
Median-Based PD	10 *	98.40	98.28	98.32	98.31
Median-Based DL	1	94.33	94.07	94.12	93.90
Median-Based DL	11 *	98.58	98.44	98.48	98.34
M-Weighted PD	1	95.04	94.82	94.91	94.71
M-Weighted PD	9 *	98.05	97.86	97.92	97.99
M-Weighted DL	1	97.87	97.69	97.76	97.79
M-Weighted DL	15 *	98.76	98.71	98.74	98.75

Note: Bold values indicate (i) the performance of the best individual teacher (first row), and (ii) the best results among all aggregation methods. * Round number where the method achieved its best performance.

Table 11. Recommended training rounds for each aggregation method.

Aggregation Method	Best Round
Krum Aggregation (PD)	6
Krum Aggregation (DL)	6
Byzantine Resilient Aggregation (PD)	5
Byzantine Resilient Aggregation (DL)	2
Mean Aggregation (PD)	6
Mean Aggregation (DL)	3
Median Aggregation (PD)	5
Median Aggregation (DL)	5
Weighted Aggregation (PD)	7
Weighted Aggregation (DL)	6

Table 12. Best aggregation level (PD vs. DL) for each method.

Aggregation Method	Best Aggregation Level (PD or DL)
Krum Aggregation	DL Aggregation
Byzantine-Resilient Aggregation	PD Aggregation
Mean Aggregation	PD Aggregation
Median Aggregation	DL Aggregation
Weighted Aggregation	DL Aggregation

Table 13. Performance metrics for mixed-level aggregation across rounds.

Round	Aggregation Method	Accuracy (%)	Weighted Precision (%)	Weighted Recall (%)	Weighted F1-Score (%)
1	Byzantine-Resilient DL Aggregation	95.57	95.24	95.57	95.20
2	Byzantine-Resilient DL Aggregation	96.63	97.26	96.63	96.53
3	Byzantine-Resilient Probability Aggregation	96.99	97.46	96.99	96.96
4	Byzantine-Resilient Probability Aggregation	98.23	98.32	98.23	98.22
5	Krum-Based DL Aggregation	98.58	98.62	98.58	98.56
6	Krum-Based DL Aggregation	98.76	98.85	98.76	98.74
7	Mean-Based DL Aggregation	98.90	98.92	98.90	98.89

Table 14. Performance metrics for DL aggregation across training rounds.

Round	Aggregation	Accuracy	Recall	Precision	F1-Score
1	Krum DL	95.70	95.55	95.65	95.60
2	Byzantine-Resilient DL	97.00	96.85	96.90	96.87
3	Mean DL	97.40	97.20	97.30	97.25
4	Weighted DL	97.80	97.65	97.70	97.68
5	Byzantine-Resilient DL	98.00	97.85	97.90	97.87
6	Median DL	98.20	98.05	98.10	98.07
7	Weighted DL	98.35	98.20	98.25	98.23

Table 15. Performance metrics for probability-based aggregation across training rounds.

Round	Aggregation	Accuracy	Recall	Precision	F1-Score
1	Krum Probability	95.56	95.40	95.50	95.45
2	Byzantine-Resilient Probability	96.89	96.70	96.80	96.75
3	Mean Probability	97.23	97.00	97.10	97.05
4	Weighted Probability	97.64	97.50	97.55	97.52
5	Byzantine-Resilient Probability	97.85	97.70	97.75	97.72
6	Median Probability	98.10	98.00	98.05	98.02
7	Weighted Probability	98.25	98.15	98.20	98.18

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Published by MDPI on behalf of the International Institute of Knowledge Innovation and Invention. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hamdi, A.; Noura, H.N.; Azar, J. A Multi-Teacher Knowledge Distillation Framework with Aggregation Techniques for Lightweight Deep Models. Appl. Syst. Innov. 2025, 8, 146. https://doi.org/10.3390/asi8050146

AMA Style

Hamdi A, Noura HN, Azar J. A Multi-Teacher Knowledge Distillation Framework with Aggregation Techniques for Lightweight Deep Models. Applied System Innovation. 2025; 8(5):146. https://doi.org/10.3390/asi8050146

Chicago/Turabian Style

Hamdi, Ahmed, Hassan N. Noura, and Joseph Azar. 2025. "A Multi-Teacher Knowledge Distillation Framework with Aggregation Techniques for Lightweight Deep Models" Applied System Innovation 8, no. 5: 146. https://doi.org/10.3390/asi8050146

APA Style

Hamdi, A., Noura, H. N., & Azar, J. (2025). A Multi-Teacher Knowledge Distillation Framework with Aggregation Techniques for Lightweight Deep Models. Applied System Innovation, 8(5), 146. https://doi.org/10.3390/asi8050146

Article Menu

A Multi-Teacher Knowledge Distillation Framework with Aggregation Techniques for Lightweight Deep Models

Abstract

1. Introduction

1.1. Motivation

1.2. Contributions

1.3. Organization

2. Background and Preliminaries

2.1. Edge Devices

2.2. Computational Resource Constraints and Deployment Challenges

2.2.1. Computational Resource Constraints

2.2.2. Deployment Challenges

2.3. Neural Network Compression

2.4. KD

2.5. MT-KD Approaches

2.6. Aggregation Techniques in MT-KD

2.6.1. Parallel Aggregation

2.6.2. Successive Aggregation

2.7. Aggregation Techniques in FL

2.8. Overview of Aggregation Strategies

2.9. Aggregation Function

3. Prposed Method

3.1. Aggregation Levels in MT-KD

3.1.1. PD

3.1.2. DL

3.1.3. Comparison of Aggregation Levels

3.2. Proposed Method: MPMTD Framework

3.3. Fixed Aggregation Strategy

3.4. Adaptive Multi-Round Aggregation Strategy

4. Experimental Results

4.1. Methodology Overview

4.1.1. Dataset Description

4.1.2. Data Preprocessing

4.1.3. Data Splitting

4.1.4. Student Model Training

4.1.5. Teacher Model Training

4.1.6. Evaluation Metrics

4.1.7. Experimental Setup

4.2. KD Process

4.3. Baseline Performance

4.4. Individual Aggregation Analysis

Performance Metrics Overview

4.5. Accuracy Difference Between Best Teacher and Student over Training Rounds

4.5.1. Performance Analysis

4.5.2. Visualization of Accuracy Difference Trends

4.5.3. Optimal Number of Rounds

4.6. Evaluation of PD vs. DL Aggregation

4.6.1. Aggregation Methods Where Loss-Based Aggregation Outperforms Probability-Based Aggregation

4.6.2. Aggregation Methods Where Probability-Based Aggregation Outperforms Loss-Based Aggregation

4.6.3. Interpretation of Results

4.7. Evaluation of PL-Level and DL-Level Aggregation Strategies

4.7.1. Performance of PL-Level Aggregation

4.7.2. Performance of DL-Level Aggregation

4.8. Multi-Round Aggregation with Varying Techniques per Round

4.8.1. Mixed-Level Aggregation Across Rounds

4.8.2. Same-Level Aggregation Across Rounds

5. Discussion

5.1. Advantages and Contributions

5.2. Limitations and Challenges

5.3. Real-World Applications

5.4. Future Research Directions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI