1. Introduction
The articulatory system consists of the temporomandibular joint (TMJ), intra-articular discs, jaw muscles, and occlusion. The TMJ, which connects the upper and lower jaws [
1], is a paired and symmetrical joint, whose coordinated functionality on both sides of the mandible is essential for balanced jaw movement. The superior compartment separates the glenoid fossa of the temporal bone from the disc, while the inferior compartment separates the disk from the mandibular condyle (see
Figure 1). This bilateral and symmetric arrangement enables speech, chewing, and both verbal and emotional expression [
2]. Temporomandibular disorders (TMD) are conditions that affect the temporomandibular joint and associated structures. Even though the TMJs are structurally symmetrical, pathology often leads to functional asymmetry, producing alterations in disk positioning, joint space, and condylar mobility. These imbalances may arise from any stimulus impacting a component of the articulatory system, potentially influencing the entire system. The most common disorders include disk alterations, joint pain, joint dysfunction, and degenerative joint disease [
3]. TMDs are among the most prevalent pathologies, with a higher incidence in the Americas [
4], and primarily affect women [
5]. The pain associated with TMD is often comparable in intensity and persistence to cervical pain, back pain, and headaches, significantly impacting the quality of life of patients. Consequently, the observation and assessment of the TMJ have become essential in the orofacial field.
One approach for evaluating the TMJ involves measuring the joint space width through imaging techniques, with particular attention on detecting reductions. In this approach, it is essential to highlight components, such as the mandibular condyle, joint space, and the glenoid fossa of the temporal bone. Imaging techniques employed in the examination of the TMJ include computed tomography (CT), cone beam computed tomography (CBCT), and magnetic resonance imaging (MRI), with the latter being the most frequently utilized for evaluating intra-articular processes [
7]. These techniques yield static images, which present a significant limitation in evaluating the dynamic movements of the TMJ, especially during real-time assessments. To address this limitation, ultrasonography offers an alternative approach, as it is both readily accessible and cost-effective for evaluating the TMJ [
2]. Despite the advancements offered by ultrasonography, evaluation of the TMJ continues to rely on manual measurements, inherently prone to errors and, sometimes, time-consimung. Consequently, there is a growing interest in exploring alternative approaches to automate these processes and improve diagnostic precision in TMJ assessments. Artificial intelligence has proven to be an effective tool for optimizing this process. In biomedical imaging, deep learning has made significant progress, allowing models to handle tasks such as classification, detection, tracking, and segmentation [
8].
Several studies have applied deep learning techniques to the segmentation of TMJ components. Most of these works focus on MRI images [
9], leveraging their high resolution and ability to visualize soft tissues. For instance, Ito et al. [
10] developed 3DiscNet for the automated detection and segmentation of the TMJ disc, and evaluated their architecture with two architectures: U-Net, and SegNet. The highest Dice coefficients achieved by 3DiscNet and SegNet were 0.70 and 0.74, respectively. Similarly, Kin et al. [
11] proposed a deep learning-based algorithm to predict TMJ disk perforation, employing a multilayer perceptron and comparing it with a Random Forest model. In their work, the multilayer perceptron achieved the highest performance, with an Area Under the Curve (AUC) of 0.94. Additionally, Li et al. [
12] utilized convolutional neural networks to delineate the mandibular condyle, articular eminence, and TMJ disc. A Dice coefficient of approximately 0.7 was obtained for the articular disc, while values greater than 0.9 were achieved for the mandibular condyle. Beyond MRI, some studies have explored CBCT images for TMJ analysis. Mao et al. [
13] developed an automated system for diagnosing degenerative TMJ disease using a YOLOv10-based algorithm. Choi et al. [
14] evaluated multiple architectures, Res18, Res50, Res101, VGG16, VGG19, and GoogleNet, to diagnose joint disease, with GoogleNet yielding the F1-score of 0.72. However, these studies do not involve ultrasound imaging for the TMJ, and research utilizing ultrasound in this context remains scarce. Currently, the only study reported is that of Lasek et al. [
4], upon which the current work is based.
Lasek et al. proposed and validated an artificial intelligence-driven approach for the automatic and consistent measurement of TMJ space width using ultrasound imaging. Their methodology encompassed the evaluation of seven deep learning architectures: Attention U-Net, U-Net++, DeepLabv3, SegResNet, SegResNet with a Variational Autoencoder, Residual U-Net, and V-Net. The goal was to segment three key TMJ structures: the mandibular condyle, the joint space, and the glenoid fossa of the temporal bone. Among the models assessed, Residual U-Net exhibited the highest performance, reaching a Dice coefficient of 0.75. These findings underscore the complexity of accurately segmenting TMJ components in ultrasound images and reflect the persistent challenges associated with this task [
4], and the need for improved models capable of enhancing segmentation accuracy.
Recent studies have explored the challenge of detecting weak or asymmetric features in medical images using attention and feature fusion mechanisms. For instance, Rehman et al. [
15], proposed a hybrid Vision Transformer (ViT) and VGG-16 framework to detect architectural distortions in mammograms, effectively addressing texture heterogeneity and subtle structural asymmetries. Similarly, Pan et al. [
16], introduced YOLO-TARC, a YOLOv10 variant with token attention and residual convolution, achieving superior detection of small voids in dental X-ray images. These approaches highlight the relevance of attention-based and adaptive kernel mechanisms for enhancing the representation of subtle anatomical patterns an aspect also crucial for accurate TMJ ultrasound segmentation.
Inspired by the scarce exploration of TMJ segmentation using ultrasound imaging, this work introduces a novel DenseUNet architecture. The model integrates optimized asymmetric convolutional kernels via an iterated local search metaheuristic to better capture directional texture variations, a characteristic of ultrasound data. Unlike approaches that rely on fixed symmetric kernels (e.g., ), our method introduces an automatic optimization strategy to adapt kernel shapes for ultrasound data. This allows the model to more effectively capture directional and elongated texture variations, which are characteristic of this imaging modality. Additionally, Squeeze-and-Excitation blocks are incorporated to enhance feature recalibration and improve representational capacity. The proposed approach achieved a Dice coefficient of 0.78 and an average processing time of 0.16 s per image, outperforming twelve architectures and demonstrating its capacity to advance the automatic analysis of TMJ structures in ultrasound images.
The remainder of this paper is organized as follows.
Section 2 details the dataset and outlines the methodologies employed in the proposed architecture.
Section 3 presents the experimental results and provides a comprehensive discussion, supported by illustrative images and graphs to aid interpretation. Lastly, the paper concludes with a summary of the principal findings.
3. Proposed DenseUNet with Asymmetric Kernels
DenseUNet architecture was modified with two main improvements. First, the kernel sizes within the dense blocks were adjusted using ILS to generate a optimized asymmetric kernel, which enhance the capability of the model to capture elongated structures. Second, a squeeze-and-excitation (SE) block was integrated into each dense block, allowing the network to better model inter-channel dependencies. The following sections describe these changes in detail, and
Figure 4 illustrates the modified dense block, with the alterations highlighted in red.
Each dense block is composed of
l layers, where each layer applies a non-linear transformation
. In this work,
is defined using
and
convolutional filters, followed by a ReLU activation and an SE block (see
Figure 4). The convolutional kernels alternate between
and
depending on whether the layer index
l is even or odd. This design enables directional feature extraction in both vertical and horizontal orientations, thereby introducing asymmetry into the convolutions.
The kernel dimensions were optimized using an ILS algorithm, aiming to identify the most effective asymmetric configurations ( and ). The optimization objective was to maximize the segmentation accuracy, measured by the Dice coefficient on the validation set. The search process consisted of 150 iterations, exploring kernel sizes between 1 and 15, which resulted in a search space of possible configurations. At each iteration, a candidate configuration was evaluated, and the best-performing solution was retained. The neighborhood of a solution was defined by varying one kernel dimension (x or y) by ±1 while keeping the other fixed. If no improvement in the Dice coefficient was observed after a local search, a perturbation was applied by randomly modifying one of the kernel dimensions of a convolutional filter () to escape local optima. A new configuration was accepted only if it achieved a higher Dice score than the current best. The search terminated when no improvement was observed over five consecutive iterations or when the iteration limit (150) was reached.
The SE block recalibrates the convolutional feature maps [
26] through three phases: squeeze, excitation, and rescaling. During the squeeze step, global average pooling reduces the spatial dimension of each feature map, generating a channel descriptor. Next, in the excitation step, two fully connected layers with ReLU and sigmoid activation are used to model non-linear inter-channel dependencies. Finally, in the rescaling phase, the learned channel weights are applied to the original features, highlighting the most informative channels and suppressing less relevant ones.
The proposed model preserves the overall organization of the DenseUNet introduced by Cao et al. [
20] (
Figure 5), consisting of three stages: encoder, bridge, and decoder. The encoder is built from a sequence of dense blocks followed by transition layers, repeated four times to extract progressively deeper features. The bridge contains an additional dense block that encodes abstract global representations. The decoder mirrors the encoder structure, with up-sampling layers followed by dense blocks, also repeated four times, to gradually refine the segmentation output. Each dense block was set to include four layers with a growth rate of 16, determining the number of feature maps produced per layer. The network was trained using 75% of the images for training and the remaining 25% was used for validation and testing. It is important to mention that the dataset was already organized into predefined subfolders, so a random split was not required. Training was performed for 30 epochs with a batch size of 4, using categorical cross-entropy as the loss function. The model was optimized with the Adam optimizer using its default parameters (learning rate = 0.001,
= 0.9,
= 0.999,
=
). To ensure reproducibility, all experiments were conducted with fixed random seeds across NumPy, TensorFlow, and Python environments.
To evaluate the performance of the proposed model, standard segmentation metrics were employed and compared with existing approaches [
4]. The chosen metrics were the Dice coefficient, Precision, Recall, and F1-score all of which compare the predicted segmentation to the ground-truth. These metrics take values between 0 and 1, where higher values indicate greater similarity. For this work, assuming that the target class is assigned the label value 1 among the four possible class labels (0: background, 1: MC, 2: JS, and 3: GF), true positives (
TP) correspond to pixels correctly predicted as class 1. False positives (
FP) are pixels incorrectly predicted as class 1 but actually belonging to classes 0, 2, or 3. False negatives (
FN) refer to pixels from class 1 that were incorrectly labeled as 0, 2, or 3. The definitions of the metrics are as follows in Equations (
2)–(
5):
4. Results and Discussion
This section presents the results obtained from the experiments. First, the results of the ILS algorithm are discussed, analyzing the impact of kernel size selection on the Dice coefficient and computation time. Next, the performance of the proposed DenseUNet model is evaluated by comparing it with reference architectures. A quantitative analysis is provided to illustrate the strengths and limitations of the proposed approach. Finally, a visual analysis is presented to examine the segmentation results from a qualitative perspective.
The experiments presented in this section were conducted using the servers of Laboratorio de Supercómputo del Bajío (Lab-SB), with the following specifications: 128 GB of RAM, an Intel Xeon Silver 4214 processor, and an NVIDIA Titan RTX 24 GB graphics card. All implementations were developed in Python 3.10.12, utilizing TensorFlow 2.1.0.
The proposed architectural improvements were specifically applied to the dense block of DenseUNet, including kernel size adjustments and the integration of an SE block. To identify the optimal configuration of the asymmetric kernel, a search was performed using the ILS algorithm.
Figure 6a depicts the evolution of filter combinations across 150 iterations. The y-axis represents the iteration number, while the x-axis corresponds to performance, measured by the Dice coefficient (ranging from 0 for low performance to 1 for high performance). Each model was trained for 30 epochs under identical hyperparameters, with kernel size as the only varying factor.
The ILS algorithm evaluated the proposed DenseUNet architecture across 150 iterations, using the Dice coefficient as the selection criterion. As shown in
Figure 6a, the best-performing kernel combination emerged in iteration 1 (green dot), where asymmetric kernels of sizes 1 × 13 and 13 × 1 achieved a Dice coefficient of 0.785. No subsequent configuration exceeded this performance. Other competitive results, also highlighted in green, were obtained with the asymmetric kernels 7 × 6 and 2 × 8, both reaching a Dice coefficient of 0.752.
Conversely, the ILS algorithm also assessed conventional kernel configurations, represented by red dots in
Figure 6a. Standard kernels commonly used in the literature, such as 3 × 3 and 5 × 5, achieved Dice coefficients of 0.409 and 0.244, respectively, highlighting their limited ability to adapt to the data characteristics. Extremely large kernels, like 13 × 13, and very small ones, such as 1 × 1, were also tested, yielding similarly low performance. Among conventional kernels, the 3 × 3 configuration performed best, yet its Dice coefficient remained substantially lower than that of the asymmetric 1 × 13 kernel. Overall, most kernel combinations produced Dice scores between 0.2 and 0.3, demonstrating that the majority of conventional kernels were ineffective at capturing the relevant features of the dataset.
Figure 6b depicts the relationship between the various kernel configurations and the processing time of the DenseUNet architecture. The top-performing kernels are indicated by blue dots, and the dashed blue line represents the first quartile of execution times among the evaluated models. Notably, the 1 × 13 kernel again stands out, exhibiting the shortest training time among the most accurate kernels, falling below the first quartile. This configuration required 5423 s to complete 30 training epochs. The total computational time for the 150 iterations conducted by the ILS algorithm, covering 150 models within a search space of
possible combinations, was 496.11 h (approximately 20 days). Most of this time was consumed by large kernel configurations, such as 15 × 15 (iteration 149, 33,602 s) and 14 × 14 (iteration 132, 29,056 s). In contrast, smaller asymmetric kernels, such as 1 × 9 (iteration 58, 2106 s) and 1 × 4 (iteration 101, 2484 s), resulted in the shortest execution times.
After evaluating the asymmetric kernel using the ILS algorithm, the best-performing architecture was employed for the final training phase. In order to maintain methodological consistency with the reference study used for comparison, we report a single training-testing split without cross-validation. This approach, as well as the small number of samples available for testing, undermines general conclusions from results. Nonetheless, notice tests evaluate performance in a pixel-wise basis, providing an in-depht view of segmentation behavior. To analyze the behavior of the proposed architecture concerning each class, the confusion matrix shown in
Figure 7 was generated. This matrix was normalized in the range of 0 to 1, since the comparison between the segmented images and the ground-truth was performed at the pixel level. In this matrix, values close to 1 (represented in dark blue) indicate a high degree of similarity with the ground-truth, whereas values close to 0 (light tones) reflect low similarity between the predicted segmentation and the reference.
The confusion matrix provides insights into both correct classifications and the types of errors made by the model. As a large proportion of image pixels correspond to the background, the model achieved high accuracy in this class. However, the mandibular condyle resulted the most challenging structure to segment, yielding the lowest performance. This indicates that the model struggled to learn distinctive features for this region, leading to frequent misclassification of its pixels as background and, to a lesser extent, as joint space. In contrast, the joint space class exhibited notably better segmentation results, although some of its pixels were still incorrectly assigned to the background. A similar behavior was observed for the glenoid fossa, where residual confusion with background pixels persisted. The per-class Dice coefficients further clarify these findings, complementing the confusion matrix. The proposed model achieved Dice scores of 0.49 for the mandibular condyle, 0.80 for the joint space, and 0.86 for the glenoid fossa. These discrepancies can be partly attributed to the high entropy and speckle noise inherent to ultrasound images, which complicate the distinction between anatomical boundaries and non-informative regions.
In
Figure 8, the behavior of the
TP, TN,
FP, and
FN values can be observed for one of the three examples previously shown in
Figure 2. The image shows the overlap between the prediction (highlighted in red) and the ground truth (highlighted in green). The yellow color represents the true positives (
TP), i.e., the regions where the prediction matches the ground truth. The black color corresponds to the true negatives (
TN), which belong to the background of the image. Meanwhile, red indicates the false positives (
FP) areas that the model predicted as positive but are not whereas green represents the false negatives (
FN), regions that the model failed to detect correctly. In
Figure 8, the proposed model demonstrates strong performance in segmenting the three classes. However, some regions exhibit over-segmentation, which can be attributed to speckle noise and the inherent intensity variability of ultrasound images. These conditions may induce spurious activations, particularly when asymmetric kernels and channel recalibration mechanisms are employed. Nevertheless, the integration of SE blocks and asymmetric convolutions significantly enhances the sensitivity of the model to low-contrast structures.
A comparison was conducted between the proposed DenseUNet architecture and several representative state-of-the-art models, as shown in
Table 1. The evaluated architectures include: Attention U-Net, U-Net++, DeepLabv3, SegResNet, SegResNet-VAE, Residual U-Net, and V-Net. In addition, the table presents a progressive evaluation of different architectural configurations, which can be interpreted as an ablation study demostrating the individual and combined impact of dense connections, SE blocks, and asymmetric kernels in the proposed DenseUNet variant. Four metrics were used for this comparison: Dice coefficient, precision, recall, and F1-score. As shown in
Table 1, the proposed architecture achieved the best performance in terms of the Dice coefficient, reaching a value of 0.78. It was follows: closely by the Residual U-Net architecture proposed by Lasek et al. [
4]. A similar trend was observed for the precision metric, where the proposed model also ranked first, with a value of 0.84. In terms of recall, the proposed architecture did not achieve the highest score, being outperformed by V-Net. However, this higher recall came at the cost of lower precision, indicating that V-Net exhibits a bias toward sensitivity, favoring the detection of true positives while also increasing the number of false positives.
Both the original DenseUNet and the DenseUNet-SE variants showed considerably lower performance, with the latter scoring below 0.30 across all evaluation metrics. The DenseUNet version with SE blocks performed worse than the baseline, likely because the SE mechanism focuses on global feature recalibration, which may not adequately capture specific structural patterns, such as elongated anatomical features. By relying on aggregated information from previous layers, SE blocks can overlook critical details related to the spatial orientation of these structures. In contrast, combining SE blocks with asymmetric kernels enables the network to more effectively focus on elongated features, which likely explains the substantial performance gains observed in the DenseUNet-SE model with asymmetric kernels. Meanwhile, other architectures such as Residual U-Net and SegResNet achieved balanced results across the evaluation metrics but were still outperformed by the proposed approach. These results further support the idea that the use of asymmetric kernels enhances the model’s ability to handle complex image segmentation tasks.
Overall, the results indicate a clear improvement from the original DenseUNet (Dice coefficient: 0.53) to the enhanced version incorporating asymmetric kernels (Dice: 0.78), highlighting the effectiveness of this structural modification. This gain can be attributed to the enhanced capacity of the asymmetric kernels to capture diverse spatial patterns. Furthermore, the proposed architecture achieved an average processing time of 0.16 s per image, demonstrating both high computational efficiency and robust segmentation performance.
The results of the proposed DenseUNet model outperformed those reported. The optimal model from [
4], a Residual U-Net, primarily addresses the vanishing gradient issue. This aspect motivated the selection of the DenseUNet architecture, which utilizes dense connections to maintain gradient flow and better capture morphological features. Additionally, DenseNet and DenseUNet were developed as alternatives to ResNet and Residual U-Net architectures; it was considered relevant to compare them with the dataset from Lasek et al. [
17].
Figure 9 presents randomly selected examples of the segmentation performed by the proposed DenseUNet model, compared with the reference for each class. The results indicate that the segmentation generated by the model closely resemble the reference. However, as evidenced by the quantitative results, notable discrepancies persist, particularly in the mandibular condyle illustrated in
Figure 9b, where the segmented structure is often incomplete or exhibits significant deviations from the reference. A similar, albeit less pronounced, issue is observed in the joint space and glenoid fossa, illustrated in
Figure 9d,f.
Both the original images and the ground-truth exhibit a predominance of elongated shapes in the dataset images. Morphological characteristics motivated the application of asymmetric kernels, inspired by previous studies that have demonstrated their effectiveness in reducing computational costs [
33,
34]. However, we specifically evaluated their impact on TMJ structure segmentation. To assess their effectiveness, the performance of asymmetric filters (e.g., 1 × N) was compared with conventional kernels (e.g., 3 × 3, 5 × 5, 7 × 7, etc.) using the ILS algorithm. The results shown in
Figure 6a indicate that asymmetric kernels outperformed conventional ones in this context. The key advantage of asymmetric kernels lies in their directional design, while conventional kernels are isotropic, capturing patterns equally in all directions, asymmetric filters prioritize specific orientations. Properties allow for better adaptation to elongated morphologies. In contrast, conventional kernels may under-represent or even omit critical structural information due to their omnidirectional nature. Lasek et al. [
4] highlight the use of morphological operations as a post-processing step to refine segmentation outputs. In contrast, the proposed architecture does not incorporate any post-processing techniques, which suggests that the inclusion of such operations could further improve segmentation accuracy, particularly for structures with complex morphology, such as those in the Mandibular condyle.