Next Article in Journal
Classroom Behavior Detection Method Based on PLA-YOLO11n
Previous Article in Journal
High-Density Microfluidic Chip with Vertical Structure for Digital PCR
Previous Article in Special Issue
Fast Anomaly Detection for Vision-Based Industrial Inspection Using Cascades of Null Subspace PCA Detectors
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MECO: Mixture-of-Expert Codebooks for Multiple Dense Prediction Tasks

Division of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Republic of Korea
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(17), 5387; https://doi.org/10.3390/s25175387
Submission received: 14 July 2025 / Revised: 21 August 2025 / Accepted: 26 August 2025 / Published: 1 September 2025

Abstract

Autonomous systems operating in embedded environments require robust scene understanding under computational constraints. Multi-task learning offers a compact alternative to deploying multiple task-specific models by jointly solving dense prediction tasks. However, recent MTL models often suffer from entangled shared feature representations and significant computational overhead. To address these limitations, we propose Mixture-of-Expert Codebooks (MECO), a novel multi-task learning framework that leverages vector quantization to design Mixture-of-Experts with lightweight codebooks. MECO disentangles task-generic and task-specific representations and enables efficient learning across multiple dense prediction tasks such as semantic segmentation and monocular depth estimation. The proposed multi-task learning model is trained end-to-end using a composite loss that combines task-specific objectives and vector quantization losses. We evaluate MECO on a real-world driving dataset collected in challenging embedded scenarios. MECO achieves a +0.4% mIoU improvement in semantic segmentation and maintains comparable depth estimation accuracy to the baseline, while reducing model parameters and FLOPs by 18.33% and 28.83%, respectively. These results demonstrate the potential of vector quantization-based Mixture-of-Experts modeling for efficient and scalable multi-task learning in embedded environments.

1. Introduction

Modern autonomous systems such as unmanned vehicles and mobile robots require comprehensive scene understanding in dynamic environments with limited computational resources [1,2]. To achieve robust understanding, recent autonomous systems adopt vision-based deep learning models for dense prediction tasks, including semantic segmentation [3] and monocular depth estimation [4]. The development of large-scale foundation models [5,6] has significantly improved the performance of vision-based models, while also increasing computational requirements. However, most mobility platforms are equipped with limited computing resources due to power and manufacturing cost constraints, making it challenging to deploy large-scale models directly. Specifically, embedded environments such as rural or mountainous regions cannot rely on real-time cloud services due to unstable communication infrastructure [7]. This requirement is in contrast to the need for multiple task-specific models and large-scale architectures to enable comprehensive scene understanding [8].
Multi-task learning (MTL) aims to jointly learn multiple tasks from a single input image by leveraging diverse supervision to enhance scene understanding [9]. Pixel-level prediction, often referred to as dense prediction, includes downstream tasks such as segmentation, depth estimation, optical flow estimation, and surface normal prediction. These tasks typically adopt encoder–decoder architectures to match the spatial resolution between the input image and the visual output. To reduce computational redundancy, hard parameter sharing-based MTL approaches employ a unified architecture that shares a single encoder across tasks while using task-specific decoders [10]. However, the shared encoder often produces entangled feature representations—also referred to as negative transfer—where task-specific and task-agnostic information are mixed, hindering task-wise optimization [11]. Existing MTL architectures can be broadly categorized into encoder-focused [12,13] and decoder-focused [14,15] designs, both of which struggle to fully disentangle shared representations [16]. In contrast, MoE-based architectures [17,18] enhance latent disentanglement by explicitly partitioning representations into task- or domain-specific subspaces, enabling each expert to specialize and improving generalization to dense prediction tasks [19].
Among various approaches in MTL, Mixture-of-Experts-based methods offer a promising solution for improving computational efficiency and disentangling feature representations by enhancing model sparsity. Mixture-of-Experts (MoE) [20] arranges multiple expert networks in parallel and selectively activates the most relevant expert routes for a given input feature map. Recent MTL models [18,21] have leveraged task-specific routers and multiple shared expert networks to efficiently disentangle task features from encoder representations. While these methods have advanced feature disentanglement and task-specific routing, they still rely on computationally heavy expert networks that limit their applicability in resource-constrained environments. These limitations highlight the need for a lightweight alternative, such as vector quantization-based experts. Moreover, deploying a large-scale MTL model on low-power embedded devices requires multiple rounds of compression techniques such as knowledge distillation [22], pruning [23], and quantization [24]. To perform progressive compression while retaining the prior knowledge of the initial large-scale model, fine-tuning at each stage is essential. Multiple fine-tuning steps incur substantial time and computational overhead, making it crucial to propagate both task-generic and task-specific representations of the initial model.
In this paper, we propose Mixture-of-Expert Codebooks (MECO), a multi-task learning framework that leverages quantized codebooks as a Mixture-of-Experts for multiple dense prediction tasks. Vector quantization (VQ) [25] transforms the continuous latent space of a model into a finite set of discrete vectors (codebook), enabling compact representations and efficient memory access. In MECO, each expert codebook operates as a subset of the latent space, disentangling and routing the encoder representation into task-specific features. Since the transformation of task-specific features is performed via Euclidean distance-based codebook lookup, it significantly reduces memory overhead. However, as Euclidean distance in high-dimensional spaces may suffer from the curse of dimensionality, we jointly learn the codebook and latent features to preserve its discriminative power in the embedding space. In the context of model compression, fixed expert codebooks learned from a large-scale encoder can be transferred to fine-tuning steps, facilitating rapid convergence in compressed models.
We introduce a method for training expert codebooks from a large-scale MTL model and evaluate its effectiveness on real-world driving datasets collected in embedded environments. Our main contributions are summarized as follows.
  • We propose an end-to-end multi-task learning framework that learns task-generic and task-specific codebooks via vector quantization, disentangling shared representations into a discrete latent space for task-specific features
  • We introduce a Mixture-of-Expert Codebooks design that substantially improves computational efficiency by replacing multiple expert modules in MoE with lightweight codebooks
  • We present a real-world driving dataset for embedded environments, enabling evaluation of the proposed method’s efficiency gains and maintained task performance over prior approaches
The rest of this paper is organized as follows. Section 2 presents related work. Section 3 explains the proposed MTL model architecture and loss functions for multiple dense prediction tasks. Section 4 and Section 5 present experimental results and conclusions.

2. Related Work

2.1. Multi-Task Learning

Multi-task learning is a unified model to learn multiple related tasks simultaneously by sharing intermediate representations, improving generalization and reducing redundancy. Recent studies have explored the use of intermediate modules between the encoder and task decoders to better disentangle shared representations into task-specific components. Inverted Pyramid multi-task Transformer (InvPT) [26] uses parallel preliminary task decoders to generate task-specific features, followed by a multi-task UP-Transformer block that fuses them in a coarse-to-fine manner, enhancing cross-task interaction but increasing computational and memory overhead. TaskExpert [21] utilizes MoE to decompose backbone representation and incorporates a multi-task feature memory as an additional expert to extract long-range task features from hierarchical encoder layers. However, as the number of experts increases, computational overhead becomes significant. To address this limitation, Mixture of Low-rank Experts (MLoRE) [18] replaces full experts with low-rank adaptation (LoRA) modules, preserving much of MoE’s disentanglement ability while improving efficiency. However, applying low-rank experts across multiple backbone layers still incurs non-trivial computational cost in resource-constrained settings. While MoE provides strong disentanglement of shared representations, applying it across multiple backbone layers still results in considerable computational cost.

2.2. Mixture-of-Experts

Mixture-of-Experts models ensemble the outputs of multiple specialized subnetworks via a routing mechanism, enabling sparse and adaptive computation. A notable study [20] proposed a top-k based MoE approach that significantly increases model capacity relative to computational cost, contributing to advancements in natural language processing. In detail, the router network transforms the semantic information of the input features into a weight vector and executes computations only on the top-k weighted experts. In the field of MTL, Ma et al. proposed Multi-gate Mixture-of-Experts (MMoE) [17] to reduce task interference and aggregate task-specific features through task-wise router networks. Recently, WEMoE [27] has been proposed as a layer-wise MoE framework that transforms modules from single-task models into experts and uses a learned router for input-dependent weight aggregation. However, it is limited by the requirement for trained single-task models, which restricts its applicability in scenarios where such models are unavailable or costly to obtain. Inspired by prior MoE approaches, the proposed MECO module employs a top-k and task-wise routing strategy from a shared encoder representation to handle multiple dense prediction tasks. To significantly reduce computational cost and compress disentangled representations from experts, we integrate VQ into the MoE framework.

2.3. Vector Quantization

Vector quantization [28] is a technique that reconstructs continuous high-dimensional vectors into a finite set of vectors, known as a codebook. In deep neural networks, VQ discretizes continuous feature representations by mapping them to the nearest codebook entries to facilitate compact representation learning. Oord et al. proposed the Vector Quantized-Variational Autoencoder (VQ-VAE) [25], which learns discrete latent representations that enable high-quality image reconstruction and efficiently extend to generative modeling. However, VQ can suffer from quantization error that reduces representation quality and codebook collapse where only a subset of codewords are utilized. To mitigate these issues, they proposed a loss function based on stop-gradient and the straight-through estimator [29]. It incorporates commitment loss to align codewords with encoder outputs and diversity regularization to encourage uniform codebook usage. Moreover, VQ compresses features into indices, reducing memory usage and enabling fast codebook access during inference. In addition, VQ-Prompt [30], which has achieved high performance in the field of class-incremental continual learning, represents task-specific prompts as discrete codes via vector quantization and optimizes them end-to-end. Inspired by prior work, the proposed MECO leverages multiple discretized codebooks to disentangle representations into highly relevant quantized vectors and trains the entire framework in an end-to-end manner.

3. Method

In this section, we present the architecture of the proposed MTL model and MECO module, and it is organized as follows. Section 3.1 summarizes the baseline MoE frameworks for MTL. Section 3.2 illustrates the overall architecture of the proposed MTL model. Section 3.3 and Section 3.4 elaborates on the MECO module and task-specific vector quantization, and Section 3.5 defines the loss functions to jointly optimize the proposed model and its codebooks. We have summarized the notations and subscripts used in this section in Table 1.

3.1. Preliminaries: Mixture-of-Experts

Before introducing our proposed MECO-based MTL model, we first review the standard MoE architecture [20] and its variants used in MTL. Let x R M × D be the input feature representation, where M and D denote the number of tokens and the latent dimension, respectively. An MoE model consists of a router network f r ( · ) and a set of N expert networks, denoted as E = { E 1 , , E N } . The router network computes a weight vector w R N , assigning higher values to the most relevant experts based on the semantic features of x. Each expert is implemented as a fully connected layer that independently extracts features from the input x. The outputs of the experts are aggregated through a weighted summation as a linear combination with the router weights w and can be formulated as follows:
U = f moe ( x ) = n = 1 N w n · E n ( x ) , f r ( x ) = w .
where U denotes the final output of the MoE f moe ( · ) , with w n representing the scalar weight for expert E n .
To extend MoE to MTL, the Multi-gate Mixture-of-Experts framework [17] was proposed to extract task-specific features from shared experts. MMoE modifies the standard MoE by introducing task-specific routers f r , t ( · ) for each task t = { 1 , , T } , allowing the model to learn distinct routing strategies for different tasks. Subsequently, the Mixture of Low-rank Experts [18] was proposed, which applies LoRA to experts in order to reduce their memory overhead. In MLoRE, the input embeddings are first permuted into spatial representations x R D × H × W and then passed through task-specific feature extractors f e , t ( · ) implemented as a 1 × 1 convolution. Each expert is composed of a low-rank bottleneck structure with rank r to reduce the number of trainable parameters and computational cost and is expressed as follows:
E n ( x ) = B n ( A n ( x ) ) ,
where A n R 3 × 3 × D × r and B n R 1 × 1 × r × D denote 3 × 3 and 1 × 1 convolutions, respectively, and r D . In this case, the multiply–accumulate operations (MACs) for each expert are computed as H × W × r × D × ( 3 2 + 1 ) . The MLoRE module f mlore ( · ) produces the final output for task t, defined as follows:
U t = f mlore ( x ) = n = 1 N w t , n · E n ( f e , t ( x ) ) , f r , t ( x ) = w t .

3.2. The MTL Model Architecture

The proposed MTL model is based on the MMoE framework and incorporates the MECO to disentangle latent representations, as illustrated in Figure 1. The overall architecture consists of three major components: a backbone network, latent representation disentanglement, and task-specific heads. First, we adopt a Vision Transformer (ViT) [31] backbone since its ability to model long-range dependencies facilitates learning shared representations across tasks [32,33]. The ViT backbone with L layers takes an RGB image I R 3 × H × W as input, where H and W denote the height and width, respectively. Second, the latent representation disentanglement stage processes multi-scale features extracted from each ViT layer l { 1 , , L } and consists of the MECO module f meco ( · ) , task-specific vector quantization f specific ( · ) , and a task-generic route f generic ( · ) . Finally, each task-specific head f h , t ( · ) generates a dense prediction Y task , which can be formulated as follows:
Y task = f h , t l = 1 L ( f meco ( F l ( I ) ) + f specific ( F l ( I ) ) + f generic ( F l ( I ) ) ) ,
where F l ( I ) denotes the intermediate representation of the l-th backbone layer.
Latent representation disentanglement aims to decompose the intermediate representation F l ( I ) = x l R D × H × W through three distinct routes. Note that t { 1 , , T } and n { 1 , , N } denote the task and expert indices, and s and g indicate task-specific and task-generic components, respectively. For clarity, we omit the backbone layer index l in what follows. The feature extractors { f e , t s ( · ) } t = 1 T and f e g ( · ) , implemented as 1 × 1 convolutions, compute the T task-specific representations z t s ( x ) and a task-generic representation z g ( x ) , respectively. Generally, the representation z ( x ) is defined as
z ( x ) = f e ( x ) .
The first and second routes, f meco ( · ) and f specific ( · ) , take z t s ( x ) as input and map into quantized latent spaces using codebooks that represent task-generic and task-specific semantics. The third route, f generic , directly passes z g ( x ) as a residual connection, preserving generic semantic information. This residual path also mitigates potential information loss from VQ by providing an uncompressed representation to the task-specific head. The disentangled representations from the three routes are fused into a task-specific feature U t s , computed as follows:
U t s = n = 1 N w t , n · Quantize n ( z t s ( x ) ) + Quantize t ( z t s ( x ) ) + z g ( x ) , f r , t s ( z t s ( x ) ) = w t ,
where Quantize n ( · ) and Quantize t ( · ) represent VQ function using the nth expert codebook and the task-specific codebook, respectively. We applied batch normalization after the MECO term and the summation to ensure training stability and scale consistency.

3.3. Mixture-of-Expert Codebooks

The MECO module is designed to extract task-specific features from N task-generic expert codebooks. The detailed architecture is illustrated in Figure 2. MECO follows a top-k MoE structure, consisting of task-specific routers { f r , t s ( · ) } t = 1 T and a set of expert codebooks E = { E n g } n = 1 N . The task-specific router takes the task-specific representation z t s ( x ) as input and produces a weight vector w t R N . For each selected expert index n, VQ is performed on z t s ( x ) using the corresponding expert codebook E n g . Each expert codebook defines an independently distributed discrete latent space that is shared across tasks, promoting the disentanglement of latent representations. Finally, the task-specific feature Q t s is computed as a weighted summation of the quantized outputs from the k selected expert codebooks, using the weights w t :
Q t s = n K w t , n · Quantize n ( z t s ( x ) ) ,
where K denotes the set of top-k selected indices. Note that this is expressed differently from the first term in Equation (6), as the dropped experts are masked out from the N experts.
A task-specific router processes the task-specific representation z t s ( x ) R D × H × W to compute the top-k expert selections along with their corresponding router weights w t . Spatial attention highlights important regions in the representation, which is then restructured into a condensed form for expert selection. This is followed by a 1 × 1 convolution, flattening, and spatial average pooling to produce task logits ω t R N for each expert codebook. The top-k function retains the k largest logits and masks the dropped logits with , which can be written as follows:
ω ˜ t , n = ω t , n , if n K , , otherwise .
where ω ˜ t denotes the masked logit vector for task t, which is subsequently normalized by softmax to produce a probability distribution:
w t , n = exp ( ω ˜ t , n ) m K exp ( ω ˜ t , m ) .
The router activates only the selected top-k experts during the forward pass, thereby reducing the computation of the dropped routes and promoting sparsity. This conditional execution mechanism ensures efficiency by selectively activating task-relevant experts, leading to reduced resource consumption in both training and inference.
Expert codebooks quantize the representations by mapping them to the closest entries in the k codebooks selected by the task-specific router. Given the task-specific representation z t s ( x ) R D × H × W , we express it as a set of latent vectors { z m } m = 1 M for simplicity, where z m R D and M = H × W . Each expert codebook E n g R K × D contains K codewords { e i } i = 1 K , with e i R D representing discretized latent vectors. For the selected expert codebook, VQ is performed by a nearest neighbor lookup between z m and { e i } i = 1 K , which can be formulated as follows:
z m q = e i , where i = arg min j z m e j 2 ,
where the superscript q in z m q denotes the quantized version of the latent vector z m , and the index j refers to codewords used for nearest neighbor search. This hard assignment can be represented as the following one-hot posterior categorical distribution:
q ( z m = e i x ) = 1 if i = arg min j z m e j 2 , 0 otherwise ,
where q ( z m = e i x ) denotes the deterministic probability that the latent vector z m , extracted from input x, is hard-assigned to the codeword e i . Using the quantized results z m q from all M positions, we define the VQ function for nth expert codebook Quantize n ( · ) as follows:
z t , n q ( x ) = Quantize n ( z t s ( x ) ) = Concat z t , n , m q m = 1 M ,
where z t , n q ( x ) R D × H × W denotes the quantized representation for task t and expert n, and Concat ( · ) refers to the operation that aggregates the quantized latent vectors according to their spatial positions. The MACs for codebook lookup are computed as H × W × K × D based on the dot product between z and e. Accordingly, VQ has a lower computational cost than LoRA when K < 10 r . MLoRE employs a range of expert ranks from 16 to 240 with the step of 16. In our experiments with K = 256 , VQ reduces per-expert computation by 20% to 89% in all cases except when r = 16 .

3.4. Task-Specific Vector Quantization

Task-specific vector quantization is designed to learn a task-specialized discrete latent space represented by E t s R K × D . Following the VQ mechanism of the expert codebook, each task codebook quantizes the task-specific representation z t s ( x ) and follows the same formulation as in Equations (10) and (11). The VQ function for the tth task codebook, Quantize t ( · ) , is formulated as follows:
Quantize t ( z t s ( x ) ) = Concat z t , m q m = 1 M .
To complement the features necessary for solving each task, task codebooks construct a distinct quantized latent space that is separated from the task-generic MECO to facilitate the disentanglement of task-specific representations.

3.5. Loss Function

To jointly optimize task performance and discrete representation learning in an end-to-end manner, we define a composite loss function consisting of VQ losses and dense prediction task-specific losses. Since the arg min operation used in VQ is non-differentiable, the loss L VQ is decomposed into two terms: a dictionary loss and a commitment loss. Both losses are defined based on the L2 norm between the task-specific representation z m and the quantized representation e i . The dictionary loss updates the codebook by stopping gradients from z m , whereas the commitment loss updates the feature extractor f e ( · ) by stopping gradients from e i , as follows:
L VQ = sg [ z m ] e i 2 2 Dictionary loss + β z m sg [ e i ] 2 2 Commitment loss ,
where sg [ · ] denotes the stop-gradient operator, and β is a weighting factor that balances the commitment loss. The dictionary loss encourages each expert and task codebook to learn task-generic and task-specific latent spaces, respectively. The commitment loss encourages ask-specific representation to stay close to their assigned codewords, promoting consistent codeword usage and preventing under-utilization that can lead to codebook collapse. During backpropagation, the straight-through estimator is adopted to address the non-differentiability of VQ. It approximates gradients by directly copying them from the quantized vector e i to the extractor output z m . By jointly optimizing these two VQ losses, the codebook and feature extractor align the learned embedding space, thereby mitigate the risk of the curse of dimensionality.
The total loss is composed of the expert codebook VQ loss ( L VQ g ), the task codebook VQ loss ( L VQ s ), and the dense prediction task losses. It is defined as follows:
L total = t = t a s k n K L VQ g + L VQ s + λ t L t ,
where both L VQ g and L VQ s are the sum of L V Q over M = H × W spatial locations. L VQ g is computed over the set of expert codebooks K selected for task t, and λ t denotes the weighting factor for the corresponding task loss.

4. Results

4.1. Real-World Driving Dataset

We collected a real-world driving dataset for the development of visual perception and collision avoidance systems in embedded environments, as illustrated in Figure 3. The data was acquired from driving scenarios within three different mountainous golf courses. The golf course poses unique challenges with narrow roads and nearby obstacles such as trees and fences. This driving scene also features slopes, uneven surfaces, and sand or water hazards uncommon in typical road scenes. These hazard regions are challenging due to their irregular appearance and semantic ambiguity. Furthermore, the vehicle is equipped with a compact embedded system, necessitating lightweight models capable of real-time image processing. As a result, there is a demand for efficient deep learning models that can perform dense prediction tasks, such as monocular depth estimation and semantic segmentation, under real-world constraints. The objective of this research is to develop a large-scale teacher model as an intermediate step toward a lightweight MTL framework optimized for deployment on embedded platforms.
The data collection platform consists of a multi-sensor system including a Intel RealSense D435i RGB-D (Intel, Santa Clara, USA) camera, an Ouster OS1 (Ouster, San Francisco, USA) light detection and ranging (LiDAR), GPS, and an NVIDIA Jetson AGX Orin (NVIDIA, Santa Clara, USA). All sensors are synchronized via the robot operating system (ROS), and front-view driving data is recorded at 10 Hz. To ensure environmental diversity, the data was collected across various seasons and weather conditions, including scenarios with preceding vehicles and multiple pedestrians. The camera and LiDAR were installed to capture RGB images at a resolution of 1280 × 720 and corresponding 3D point clouds. Sparse ground-truth depth maps were obtained using a camera–LiDAR calibration [34]. Semantic segmentation includes five classes: background, fairway, road, hazard, and obstacle. Hazards refer to regions such as lakes and sand traps, while obstacles include humans, vehicles, trees, and signposts.
Annotation of semantic segmentation task involves two main challenges. First, there is the rarity of critical hazard objects; classes such as person and hazard appear infrequently in driving scenes. Second, the pixel-level annotation cost is substantial, as it requires manual refinement of meaningful regions for each image. To address the limitations of constructing the segmentation dataset, we curated the training and test sets according to the following protocol: (1) Maintained a clear separation of driving sequences between the training and test sets. (2) Included data with varying weather conditions and pedestrian scenarios. (3) Incorporated hazard classes, such as bunkers and lakes, which are rare in the driving scene. (4) Selected samples maintaining a uniform distribution of distance to preceding vehicles. Initial semantic labels were generated using the Segment Anything Model (SAM) [5], followed by manual refinement for accuracy. Finally, the images, ground truth depth maps, and semantic labels for training and testing consist of 355 and 96 samples, respectively.

4.2. Experiment Environments

Experiments were conducted on a workstation equipped with an Intel Core i9-10940X CPU, 64 GB DDR4 RAM, and dual NVIDIA GeForce RTX 3090 Ti GPUs. The proposed MTL model was implemented using Python 3.7 and PyTorch 1.10. To reduce computational costs and enable further model compression, both input RGB images and ground-truth annotations were resized to a resolution of 640 × 352 during training and evaluation. We used the Adam optimizer with a learning rate of 1 × 10 5 and a weight decay of 1 × 10 6 . The batch size was set to 2 for all experiments. Due to the small dataset size and batch size, we set the number of training epochs to 500 and report the quantitative results of the model that achieved the best average performance across tasks. The ViT backbone used in the proposed model was initialized with weights pretrained on the ImageNet-1K dataset for 20K training steps. The codebooks were initialized using a uniform random distribution in the range U ( 3 , 3 ) , which ensures each embedding vector starts with equal variance and avoids bias toward any particular direction in the embedding space. As a baseline, we adopted Mixture of Low-rank Experts, which was trained under the same experimental settings.

4.3. Hyperparameters and Evaluation Metrics

The proposed model incorporates several hyperparameters related to the MoE framework and VQ. Both the number of experts and the top-k selection parameter were set to 15 and 9, respectively, consistent with the MLoRE baseline. These settings are maintained in our main experiments, and their detailed effects are discussed in the ablation study (Section 4.6). The number of latent vectors in each expert codebook was set to K = 256 . The commitment loss weighting factor β = 0.25 was adopted following common VQ-VAE practice, balancing the commitment and dictionary losses while ensuring stable codebook updates. For semantic segmentation, we report the mean intersection over union (mIoU) as the primary evaluation metric, along with class-wise IoU. For monocular depth estimation, we adopt four commonly used error metrics: root mean squared error (RMSE), RMSE log, absolute relative error (Abs Rel), and squared relative error (Sq Rel). To assess model efficiency, we report the number of parameters (#Params), floating point operations (FLOPs), and frames per second (FPS) as computational cost indicators.

4.4. Experimental Results on Real-World Dataset

To evaluate the effectiveness of the proposed MTL model on a real-world driving dataset, we conducted a quantitative analysis. Table 2 presents the test results of both single-task and the MTL models on the dense prediction tasks of depth estimation and semantic segmentation. In general, MoE-based MTL models outperform their single-task counterparts and demonstrate the advantage of producing multiple task outputs with a unified model. The Mamba-based MTL models generally exhibit lower performance, showing inferior results in both depth estimation and semantic segmentation compared to the MoE-based MTL models. This suggests that the MoE-based approach is more robust to negative transfer by deriving task-specific interpretations of the latent representation from experts. Compared to the baseline MLoRE, the proposed MECO achieves a 0.4% higher mIoU in semantic segmentation. Additionally, MECO shows improved performance across all semantic classes except for the background, while maintaining depth estimation performance close to that of the baseline. These results suggest that the codebooks disentangle representations, thereby enhancing classification performance through distinct latent features.

4.5. Computational Complexity

Table 3 compares the computational complexity of the baseline model and the proposed MTL model. By replacing expert modules with codebook-based VQ, MECO achieves significant efficiency gains over the LoRA-based baseline. Specifically, MECO reduces the number of parameters and FLOPs by 18.33% and 28.83%, respectively, compared to MLoRE. The large reduction in FLOPs is attributed to the nature of VQ, where the representation is obtained through an arg min lookup operation rather than dense matrix multiplication, as in LoRA. Moreover, real-time inference evaluation under our experimental setup shows a practical improvement of 3.59 FPS. These results demonstrate that VQ enables substantial model compression with minimal compromise in performance.

4.6. Ablation Study

We conduct an ablation study to investigate the effectiveness of model architecture and hyperparameters on performance. Table 4 reports the quantitative performance and computational complexity with respect to the number of expert codebooks N in MECO. The number of activated experts was set to 60% of N using top-k routing, following the baseline configuration to ensure a fair comparison. As the number of experts increases, we observe consistent improvements in both segmentation performance and depth estimation accuracy. Notably, the FLOPs remain nearly constant, indicating that VQ does not significantly contribute to the actual computational load during inference. Since the codebook vectors are learnable parameters, the total number of parameters increases proportionally with the number of experts N.
Table 5 presents the performance of the proposed model on dense prediction tasks under different routing strategies and codebook update mechanisms. The proposed MECO model aims at lightweight design through vector quantization, and the routing strategy and codebook update method are selected based on empirical experimental results. Soft routing refers to a probabilistic mixture of expert outputs, where all experts are partially activated with weights derived from a softmax distribution. Experimental results show that the performance gap in semantic segmentation between the two routing strategies is merely 0.025%, whereas the difference in depth estimation reaches approximately 8%, demonstrating the superiority of top-k routing for this task. In the terms of model efficiency, top-k routing is also more appropriate than soft routing, as it activates only a subset of experts rather than all experts. In soft routing all experts receive non-zero weights so even task-irrelevant experts contribute to the output. These low-weight but irrelevant contributions can introduce noise into the aggregated features and lead to degraded performance, especially in tasks such as depth estimation where feature precision is critical.
On the other hand, the EMA-based method updates the codebook vectors using an exponential moving average (EMA) of encoder outputs, rather than optimizing them as learnable parameters. Since EMA does not require backpropagation through the codebook, the total number of parameters in MECO is reduced to 389.406 M. However, EMA does not contribute to improving inference speed, and its performance across all tasks is inferior to that of the dictionary loss-based approach. The performance gap arises from EMA not directly optimizing encoder–codeword alignment, leading to slower adaptation and less task-specific specialization.

4.7. Qualitative Results

Figure 4 presents qualitative comparisons between the proposed MTL model and the baseline model across a set of challenging cases. Each column shows the input RGB image, semantic segmentation mask, and predicted depth map, where task predictions are shown for both the baseline and MTL models. We visualized cases involving semi-transparent regions, extremely rare object (truck), distant hazards, and occlusion caused by raindrops. Regions of interest are highlighted using white bounding boxes to emphasize critical areas for comparison. In Case 1, MECO demonstrates superior segmentation of semi-transparent obstacle regions, accurately capturing their boundaries and shapes. In Case 2, MECO produces more consistent and spatially coherent depth estimates in visually ambiguous landscapes. In Case 3, both the segmentation mask and depth map of a truck show greater consistency in MECO’s output compared to the baseline, suggesting better extraction of task-generic representations. Cases 4 and 5 illustrate rainy driving conditions with image blur, where MECO demonstrates robustness by maintaining high prediction quality under low visibility.

5. Conclusions

In this paper, we presented Mixture-of-Expert Codebooks, a novel multi-task learning framework that leverages vector quantization for efficient dense prediction in embedded environments. MECO introduces expert codebooks as discrete latent spaces, which effectively disentangle task-specific and task-generic features with minimal computational overhead. By replacing traditional expert networks with quantized codebooks, the proposed model significantly reduces computational costs while maintaining competitive performance across tasks. Our experiments on a real-world driving dataset demonstrate that MECO achieves improved segmentation accuracy and depth estimation performance close to that of the baseline model. Additionally, MECO shows strong robustness under challenging conditions such as low visibility and ambiguous landscapes. However, MECO’s generalizability may be limited by the dataset size, modest performance gains, and lack of external benchmarks. We plan to address these issues through greater dataset diversity, improved robustness, and external evaluations. We also aim to explore a wider range of dense prediction tasks, such as instance segmentation and surface normal estimation. Furthermore, we will investigate deployable model compression techniques—including student model distillation and quantization-aware training—leveraging expert codebooks pretrained on large-scale MTL models.

Author Contributions

Conceptualization, G.H.; methodology, G.H.; software, G.H.; validation, G.H.; writing—original draft preparation, G.H. and S.J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP)-Innovative Human Resource Development for Local Intellectualization program grant funded by the Korea government (MSIT) (IITP-2025-RS-2024-00439292).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Liu, S.; Liu, L.; Tang, J.; Yu, B.; Wang, Y.; Shi, W. Edge computing for autonomous driving: Opportunities and challenges. Proc. IEEE 2019, 107, 1697–1716. [Google Scholar] [CrossRef]
  2. Zhang, F.S.; Ge, D.Y.; Song, J.; Xiang, W.J. Outdoor scene understanding of mobile robot via multi-sensor information fusion. J. Ind. Inf. Integr. 2022, 30, 100392. [Google Scholar] [CrossRef]
  3. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  4. Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2366–2374. [Google Scholar]
  5. Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
  6. Yang, L.; Kang, B.; Huang, Z.; Xu, X.; Feng, J.; Zhao, H. Depth anything: Unleashing the power of large-scale unlabeled data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 10371–10381. [Google Scholar]
  7. Shi, W.; Cao, J.; Zhang, Q.; Li, Y.; Xu, L. Edge computing: Vision and challenges. IEEE Internet Things J. 2016, 3, 637–646. [Google Scholar] [CrossRef]
  8. Huang, J.; Rathod, V.; Sun, C.; Zhu, M.; Korattikara, A.; Fathi, A.; Fischer, I.; Wojna, Z.; Song, Y.; Guadarrama, S.; et al. Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7310–7311. [Google Scholar]
  9. Zhang, Y.; Yang, Q. A survey on multi-task learning. IEEE Trans. Knowl. Data Eng. 2021, 34, 5586–5609. [Google Scholar] [CrossRef]
  10. Ruder, S. An overview of multi-task learning in deep neural networks. arXiv 2017, arXiv:1706.05098. [Google Scholar] [CrossRef]
  11. Vandenhende, S.; Georgoulis, S.; Proesmans, M.; Dai, D.; Gool, L.V. Revisiting Multi-Task Learning in the Deep Learning Era. arXiv 2020, arXiv:2004.13379. [Google Scholar]
  12. Liu, S.; Johns, E.; Davison, A.J. End-to-end multi-task learning with attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1871–1880. [Google Scholar]
  13. Misra, I.; Shrivastava, A.; Gupta, A.; Hebert, M. Cross-stitch networks for multi-task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3994–4003. [Google Scholar]
  14. Xu, D.; Ouyang, W.; Wang, X.; Sebe, N. Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 675–684. [Google Scholar]
  15. Vandenhende, S.; Georgoulis, S.; Van Gool, L. Mti-net: Multi-scale task interaction networks for multi-task learning. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 527–543. [Google Scholar]
  16. Vandenhende, S.; Georgoulis, S.; Van Gansbeke, W.; Proesmans, M.; Dai, D.; Van Gool, L. Multi-task learning for dense prediction tasks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3614–3633. [Google Scholar] [CrossRef]
  17. Ma, J.; Zhao, Z.; Yi, X.; Chen, J.; Hong, L.; Chi, E.H. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 1930–1939. [Google Scholar]
  18. Yang, Y.; Jiang, P.T.; Hou, Q.; Zhang, H.; Chen, J.; Li, B. Multi-task dense prediction via mixture of low-rank experts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 27927–27937. [Google Scholar]
  19. Kim, I.; Lee, J.; Kim, D. Learning mixture of domain-specific experts via disentangled factors for autonomous driving. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 1148–1156. [Google Scholar]
  20. Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv 2017, arXiv:1701.06538. [Google Scholar]
  21. Ye, H.; Xu, D. Taskexpert: Dynamically assembling multi-task representations with memorial mixture-of-experts. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 21828–21837. [Google Scholar]
  22. Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
  23. Han, S.; Pool, J.; Tran, J.; Dally, W. Learning both weights and connections for efficient neural network. In Proceedings of the 29th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 1135–1143. [Google Scholar]
  24. Zhou, S.; Wu, Y.; Ni, Z.; Zhou, X.; Wen, H.; Zou, Y. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv 2016, arXiv:1606.06160. [Google Scholar]
  25. Van Den Oord, A.; Vinyals, O.; Kavukcuoglu, K. Neural discrete representation learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6309–6318. [Google Scholar]
  26. Ye, H.; Xu, D. Inverted pyramid multi-task transformer for dense scene understanding. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 514–530. [Google Scholar]
  27. Tang, A.; Shen, L.; Luo, Y.; Yin, N.; Zhang, L.; Tao, D. Merging multi-task models via weight-ensembling mixture of experts. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024; pp. 47778–47799. [Google Scholar]
  28. Gray, R. Vector quantization. IEEE ASSP Mag. 1984, 1, 4–29. [Google Scholar] [CrossRef]
  29. Bengio, Y.; Léonard, N.; Courville, A. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv 2013, arXiv:1308.3432. [Google Scholar] [CrossRef]
  30. Jiao, L.; Lai, Q.; Li, Y.; Xu, Q. Vector quantization prompting for continual learning. Adv. Neural Inf. Process. Syst. 2024, 37, 34056–34076. [Google Scholar]
  31. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  32. Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
  33. Fnu, N.; Bansal, A. Understanding the architecture of vision transformer and its variants: A review. In Proceedings of the 2024 1st International Conference on Innovative Engineering Sciences and Technological Research (ICIESTR), Muscat, Oman, 14–15 May 2024; pp. 1–6. [Google Scholar]
  34. Tsai, D.; Worrall, S.; Shan, M.; Lohr, A.; Nebot, E. Optimising the selection of samples for robust lidar camera calibration. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; pp. 2631–2638. [Google Scholar]
  35. Lee, J.H.; Han, M.K.; Ko, D.W.; Suh, I.H. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv 2019, arXiv:1907.10326. [Google Scholar]
  36. Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
  37. Lin, B.; Jiang, W.; Chen, P.; Zhang, Y.; Liu, S.; Chen, Y.C. MTMamba: Enhancing multi-task dense scene understanding by mamba-based decoders. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 314–330. [Google Scholar]
  38. Lin, B.; Jiang, W.; Chen, P.; Liu, S.; Chen, Y.C. MTMamba++: Enhancing Multi-Task Dense Scene Understanding via Mamba-Based Decoders. IEEE Trans. Pattern Anal. Mach. Intell. 2025, Early Access. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Architecture of the proposed MTL model.
Figure 1. Architecture of the proposed MTL model.
Sensors 25 05387 g001
Figure 2. Architecture of the proposed MECO module.
Figure 2. Architecture of the proposed MECO module.
Sensors 25 05387 g002
Figure 3. Overview of a real-world driving dataset collected in mountainous golf courses, illustrating diverse conditions (top), GPS trajectories (bottom left), and camera–LiDAR calibration (bottom right).
Figure 3. Overview of a real-world driving dataset collected in mountainous golf courses, illustrating diverse conditions (top), GPS trajectories (bottom left), and camera–LiDAR calibration (bottom right).
Sensors 25 05387 g003
Figure 4. Visualization of the semantic segmentation and depth estimation results.
Figure 4. Visualization of the semantic segmentation and depth estimation results.
Sensors 25 05387 g004
Table 1. List of notations used in this paper.
Table 1. List of notations used in this paper.
NotationDescriptionNotationDescriptionNotationDescriptionNotationDescription
IInput RGB imagetTask indexhTask-specific
head
nExpert index
YDense prediction
output
sTask-specific
component
UAggregated
features
KNumber of
codewords
H , W Height and width of
the input image
gTask-generic
component
QAggregated VQ
features
e i i-th codeword
H , W Height and width of
the feature map
qQuantized
vector
z ( x ) Feature
representation
wRouter weight
vector
MNumber of tokens
( M = H × W )
LNumber of
ViT layers
zLatent vectorkNumber of
top-k selection
mSpatial position
index (VQ)
lViT layer index Quantize n ( · ) VQ function of
expert codebook
K Set of
selected
expert indices
DFeature dimensionfFunctions of
modules
Quantize t ( · ) VQ function of
task codebook
L Loss functioneExtractorECodebook
TNumber of tasksrTask-specific
router
NNumber of experts
Table 2. Depth estimation and semantic segmentation performance on the real-world driving dataset. ↓ and ↑ denote whether a lower or higher value is better for each task, respectively.
Table 2. Depth estimation and semantic segmentation performance on the real-world driving dataset. ↓ and ↑ denote whether a lower or higher value is better for each task, respectively.
TaskMethodDepth Estimation ↓Semantic Segmentation ↑
RMSERMSE LogAbs RelSq RelBackgroundFairwayHazardRoadObstaclemIoU (%)
SingleBTS [35]2.7580.1950.1290.541------
UNetFormer [36]----82.3793.8665.7995.8352.6178.09
MultipleMTMamba [37]2.1850.1810.1150.45382.9993.1250.6694.8754.2175.17
MTMamba++ [38]2.8500.2100.1290.57984.0393.9356.3195.5655.4677.06
MLoRE [18]2.1980.1570.1010.35885.6594.1661.8196.5059.9979.63
MECO (ours)2.2240.1590.1020.36785.6294.2362.9096.7160.6880.03
Table 3. Comparison of computational cost.
Table 3. Comparison of computational cost.
MethodFLOPs (T)#Params (M)FPS
MLoRE [18]1.453509.587.61
MECO (ours)1.034416.1411.20
Table 4. The effectiveness of the number of expert codebooks.
Table 4. The effectiveness of the number of expert codebooks.
NkAbs RelmIoU (%)FLOPs (T)#Params (M)
530.11079.571.034400.35
1060.10879.601.034408.25
1590.10280.031.034416.14
20120.09979.681.034422.46
Table 5. The effectiveness of the routing and codebook update method.
Table 5. The effectiveness of the routing and codebook update method.
RoutingCodebook UpdateAbs RelmIoU (%)
SoftEMA0.11279.17
Top-kEMA0.10379.15
Top-kDictionary loss0.10280.03
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hwang, G.; Lee, S.J. MECO: Mixture-of-Expert Codebooks for Multiple Dense Prediction Tasks. Sensors 2025, 25, 5387. https://doi.org/10.3390/s25175387

AMA Style

Hwang G, Lee SJ. MECO: Mixture-of-Expert Codebooks for Multiple Dense Prediction Tasks. Sensors. 2025; 25(17):5387. https://doi.org/10.3390/s25175387

Chicago/Turabian Style

Hwang, Gyutae, and Sang Jun Lee. 2025. "MECO: Mixture-of-Expert Codebooks for Multiple Dense Prediction Tasks" Sensors 25, no. 17: 5387. https://doi.org/10.3390/s25175387

APA Style

Hwang, G., & Lee, S. J. (2025). MECO: Mixture-of-Expert Codebooks for Multiple Dense Prediction Tasks. Sensors, 25(17), 5387. https://doi.org/10.3390/s25175387

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop