1. Introduction
Hallux valgus (HV) is a foot disease in which the angle of the joint between the thumb and the first metatarsal bone (also known as the “big toe”) is greater than or equal to 15 degrees. It is generally caused by a variety of conditions. Hallux valgus is a common foot deformity worldwide, with a global overall prevalence and incidence rate of 19%, which means that one in five people suffer from hallux valgus to varying degrees, and the incidence rate in women is as high as 23.74% [
1]. As shown in
Figure 1,
Figure 1a represents HV, and
Figure 1b represents a healthy foot. Clinically, X-rays are often used to determine the diagnosis and treatment plan for hallux valgus. However, X-rays have a certain radiation effect on the human body [
2], and patients need to go to the hospital to take X-rays, which causes a lot of trouble. Therefore, if there is an easy-to-use method that can be used to initially screen for hallux valgus through non-radiographic photos, it will greatly improve the doctor’s work efficiency and reduce the patient’s pain and financial losses.
Medical imaging refers to the image information formed by the morphological or functional changes of substances (mainly cells) in various organisms, with the human body as the main research object. It includes the display of the anatomy, physiological processes, and pathological states of the subject, as well as the techniques and theoretical basis for quantitative analysis of these measurements [
3]. With the development of science and technology, modern medicine has entered the era of computer-aided diagnosis and treatment. Medical image processing refers to the use of computers to simulate human visual functions and realize processes such as perception and recognition [
4].
In recent years, deep learning [
5], as an artificial intelligence algorithm, has been developing rapidly and has become one of the most advanced technologies. More and more people are beginning to try to apply it to medical-image-related problems, such as medical image segmentation, lesion localization, disease prediction, and so on. Its current application in medical images is becoming more and more widespread, greatly improving the work efficiency of medical staff [
6,
7,
8].
However, due to the high cost of testing methods such as X-rays, CT scans, MRIs, and ultrasounds, in most cases medical imaging is manually operated by doctors, which leads to low efficiency in diagnosing diseases, and repetitive work can also affect the accuracy of manual judgments. Thus, this study aims to directly determine the presence of hallux valgus from non-radiographic foot images using an improved MobileNetV3. In our study, the training set contains more than 10,000 foot photos from multiple data sources after data enhancement, which provides sufficient data support for model training and performance improvement compared to existing studies. Furthermore, unlike traditional methods based on geometric analysis and manual labeling, the goal of this study is to construct a non-radiographic, economical, and efficient HV detection system. By combining a foot image dataset (including data enhancement techniques) with advanced deep learning methods, features in non-radiographic foot images can be automatically learned to achieve more efficient and more accurate HV detection.
At the same time, the practical utility of our approach lies in three key scenarios:
Mass Screening Programs: In high-volume settings (e.g., school screenings, geriatric care facilities), automated HV detection can prioritize cases requiring urgent clinical evaluation, reducing manual triage efforts.
Telemedicine for Remote Areas: Patients in rural or low-resource regions can self-capture foot images via mobile apps for preliminary assessment, minimizing travel costs and wait times.
Longitudinal Monitoring: HV progression can be tracked in postoperative or at-risk patients through periodic photo uploads, avoiding repeated radiation exposure from X-rays.
These applications align with WHO recommendations for leveraging AI to bridge healthcare disparities [
9].
3. Materials and Methods
To address these challenges, we propose the TCBAM-enhanced MobileNetV3 with data augmentation method. We detail our approach in this section.
3.1. Neural Network
This paper proposes an improved method based on MobileNetV3. We replace the SE (squeeze and excitation) attention mechanism in the reverse residual block with a CBAM (convolutional block attention module). In contrast, the SE module can enhance features, but it only acts on the channel dimension and does not take into account information in the spatial dimension. The CBAM, on the other hand, combines spatial and channel information and is therefore more suitable for the detection of hallux valgus.
3.1.1. Overview of MobileNet V3
MobileNetV3 is a further optimization of MobileNetV1 and MobileNetV2 by Google [
21,
22], which additionally introduces SE and combines resource-constrained NAS with NetAdapt to design the most suitable network structures, large and small [
23]. On this basis, MobileNetV3 also uses the HardSwish activation function, which can achieve better performance at a deeper level. Meanwhile, MobileNetV3, based on MobileNetV2, merges the last depth separable convolution module with the output convolution layer, advances the average pooling to reduce the feature size, and removes the 3 × 3 depth convolution layer that does not change the feature dimension. This greatly reduces the computational cost of the output layer, thereby improving the inference speed and efficiency of the model. Thanks to these improvements, MobileNetV3 is more suitable for low-computing-resource environments than other classification models. On mobile devices (such as mobile phones and embedded devices), MobileNetV3 can be used in real time, with a 25% improvement in inference speed. On the ImageNet classification task, MobileNetV3 achieves a Top-1 accuracy of 75.2% on ImageNet, while MobileNetV2 is 71.8%, an improvement of 3.4%.
3.1.2. Modification of SE to CBAM
The SE module attempts to improve the learning ability of the model by allowing the neural network to adaptively reallocate the channels through learning (i.e., dynamically recalibrate the channel features). In the MobileNetV3 model, the SE module performs average pooling on the feature maps output by each convolutional layer, and then passes the results of all four channels as input vectors to the fully connected layer (FC). The FC is used to calculate the importance of each channel, and then passes through another FC and is transformed by the hard sigmoid activation function to scale the importance of each channel to the interval [0, 1]. Finally, the coefficients of each channel importance are multiplied by the corresponding feature map of that channel and weighted, which makes the network adaptively focus on more important channels and suppress irrelevant or unimportant channels. The SE module can provide more effective channel recalibration, but it only operates along the channel dimension, ignoring the spatial dependence in the feature map. In medical images, this limitation is particularly problematic in tasks such as medical image analysis, where the spatial location of specific regions in the image is critical for accurate diagnosis.
To solve this problem, we propose replacing the SE module with the CBAM [
24], which can simultaneously optimize feature representation by considering what to focus on and where to focus. This is also a more powerful attention mechanism that can operate in both the channel and spatial dimensions. The CBAM consists of two key components: the channel attention mechanism and the spatial attention mechanism. In the CBAM, the input is first processed by the channel attention mechanism, which first goes through global max pooling and global average pooling, and then goes through a fully connected layer and a sigmoid activation function to transform it. Finally, using the obtained attention weights, they are multiplied by each channel of the original feature map to obtain the attention-weighted channel feature map. Then, the channel attention-processed features are sent to the spatial attention mechanism for processing. First, max pooling and average pooling operations are performed along the channel dimension to generate features with different contextual scales. Then, the spatial attention weights are calculated through a convolutional layer. The spatial attention mechanism processes in the spatial dimension to identify which part of the spatial position in the input feature map is more important.
The CBAM is able to output a feature map weighted by both the channel attention mechanism and the spatial attention mechanism as the final result, which not only improves the representation ability of the model, but also improves its performance in various visual tasks.
Figure 2 shows the workflow of the improved MobileNetV3 model. The raw pixels of the thumb-upward image are used as input, and the initial feature extraction is performed by a standard convolutional layer, using H-Swish as the activation function. The model uses a series of reverse residual blocks, each of which consists of a depth separable convolution and CBAM, to enhance feature extraction. Finally, a global average pooling layer (GAP) is used to spatially compress all channels, and a softmax layer is used to calculate the probability distribution of each class and output the predicted hallux valgus result.
3.2. Dataset and Pre-Processing
To ensure the diversity and comprehensiveness of the data, covering different people and shooting conditions, we obtained these non-radiographic images of hallux valgus from different institutions and platforms for training. We collected a total of 2934 foot images, including 1843 images of people with hallux valgus and 1091 images of healthy feet. Among them, 900 HV images were provided by Beijing Tongren Hospital, Capital Medical University, and the remaining 2034 images were from the Roboflow platform (the world’s largest collection of open-source computer vision datasets). In order to better train and evaluate the model, we convert all images to 224 × 224 in size as the input of the model while maintaining the image features, which can reduce the consumption of computing resources while ensuring details. At the same time, the dataset is identified and labeled by medical experts to ensure the accuracy and reliability of the data used.
The backgrounds of some images in the dataset are too monotonous, and this may cause the model to mistakenly regard background information as useful features during the learning process, which affects the generalization ability and robustness of the model. To solve this problem, we used the segment anything model (SAM) to accurately extract the foot area [
25]. Then, the extracted foot area was randomly replaced with different background images (see
Figure 3) to simulate various real shooting scenarios, such as different lighting conditions, background colors, and environmental complexity. This process significantly increased the diversity of the data, enabling the model to maintain good performance in a wider range of application environments.
In addition, in order to increase the generalizability of the model, we used rotate, flip, enhance brightness, add noise, and other data enhancement methods to expand the dataset, in order to improve data diversity. An example of applying these techniques to random images in the data enhancement can be seen in
Figure 4. Through the data preprocessing and enhancement steps described above, the results provide a more complete picture of true performance and ensure that it can achieve high accuracy and good stability in practical applications. Note that since we use five-fold cross-validation, in each cross-validation iteration, the background replacement operation for the training set is dynamically generated by calling the segment anything model (SAM) in real time, and the validation set always remains the original unmodified data.
3.3. Mobile Application Implementation
We developed a custom mobile app for the detection of HV from foot images. The app is designed as a convenient screening tool that can directly classify HV deformities based on patient foot photos. Its core functionality allows users (patients or clinicians) to take or upload an image of the foot, which is then analyzed by a trained deep learning model. The app provides users with a binary result (e.g., whether or not HV is present) and visual feedback in the form of text. The mobile solution is designed to make HV screening more convenient and immediate, without the need for specialised imaging equipment. Users can take a new photo of the foot using their smartphone camera or select an existing image of the foot from their device’s gallery. Once the image is obtained, the app uploads it to the server for analysis. Upon receipt of the image, the server resizes the segmented foot image to 224 × 224 pixels to match the input size requirements of the deep learning model, ensuring consistency with the images used to train the model. The model then processes the image and generates a classification output indicating the presence or absence of HV. Finally, the server returns the inference result to the mobile app in real time and presents it to the user in text form.
3.4. Training and Testing
To rigorously evaluate the effectiveness and generalization performance of the improved MobileNetV3 model, we employ five-fold cross-validation instead of a single train-validate-test separation. Specifically, we randomly divide the entire dataset into five equal-sized folds. In each iteration, one fold is used as the validation set, and the remaining four folds are used for training, at which point we perform the pre-processing mentioned in
Section 3.2 on the training set. Data enhancement after folding does not result in data leakage. This process was repeated five times, once for each fold as the validation set.
For each round, we calculated key performance metrics including accuracy, precision, recall, F1 score, and AUC. the final reported results are expressed as the mean ± standard deviation (SD) of all five folds. This cross-validation approach provides more robust and statistically reliable estimates of model performance while minimizing the impact of potentially favorable or biased splits. It also effectively reduces the risk of overfitting or data leakage that can easily result from single-split evaluations and increases the model’s ability to generalize to the real world.
3.5. Software Environment
A NVIDIA GeForce RTX 4090 (Manufacturer: NVIDIA Corporation, Santa Clara, CA, USA) graphics card and 24G video memory are used to conduct segmentation experiments on the public platform PyTorch. The system is built using Python 3.7.16 (Python Software Foundation, Wilmington, NC, USA) and PyTorch 2.5.1.
3.6. Evaluation of the Model
Evaluating the proposed network structure as well as some of the now mainstream classification structures, we compute the following performance measures.
where
TP: true positives,
TN: true negatives,
FP: false positives,
FN: false negatives,
, and
.
In addition, we have introduced the loss standard deviation to assess the stability of the model in ablation experiments.
These criteria can be used to evaluate the modified MobileNetV3 model.
4. Results
We conducted a comprehensive evaluation of the improved MobileNetV3 model’s effectiveness in detecting hallux valgus (HV) in foot images using an expanded dataset and multiple metrics. Over 100 epochs, we compared several models, including VGG-16, ResNet, AlexNet, DenseNet, EfficientNet, MobileViTv2, MobileNetV3, and MobileNetV2. The final performance metrics (accuracy, precision, recall, and F1 score) were calculated as the mean ± standard deviation across all five folds, ensuring a robust and statistically reliable evaluation of generalization ability. These metrics collectively assess a model’s ability to correctly classify data, balancing overall accuracy, false positives, false negatives, and the trade-off between precision and recall.
Table 1 presents the results of our testing. The improved MobileNetV3 achieved accuracy, precision, recall, and F1 score values of 95.52%, 95.51%, 95.53%, and 95.53%, respectively.
Figure 5 provides a graphical representation of these results, intuitively demonstrating that the improved MobileNetV3 model’s performance has been enhanced and that it has achieved the best results among the compared models. Although the accuracy of MobileViTv2 and the improved MobileNetV3 model were not very different from each other, in computational complexity, the FLOPs and parameters of MobileViTv2 are much larger than the improved MobileNetV3 model, which requires a certain amount of computing power on the user’s device and causes slightly higher latency, resulting in a poor user experience. The improved MobileNetV3 model did not significantly increase in terms of floating-point operations (FLOPs) and parameters compared to the original MobileNetV3. The integration of the CBAM has significantly improved MobileNetV3’s overall classification performance without a substantial increase in computational cost.
To isolate the contributions of the CBAM module and background augmentation, we conducted an ablation study with four model variants (
Table 2). Replacing SE with CBAM (Variant 1) improved accuracy by 2.08% over the baseline (93.44% → 95.52%), demonstrating the CBAM’s ability to leverage both spatial and channel attention for precise feature localization. Removing background augmentation from the CBAM model (Variant 2) resulted in a 2.12% performance drop (95.52% → 93.40%), underscoring the importance of synthetic background diversity for generalization. Notably, even without augmentation, CBAM outperformed the SE baseline by 2.65% (93.40% vs. 90.75%), further validating its architectural superiority. The computational cost of CBAM remained minimal, with only a 1.72 M increase in FLOPs and no additional parameters.
Through ablation experiments, we verified the effectiveness of the CBAM in enhancing the model’s ability to focus on important features in space, while the SE module, although it has certain advantages in reducing computational overhead, is still inferior in performance to CBAM. Overall, MobileNetV3+CBAM achieves a good balance between performance and computational complexity. The CBAM significantly improves the performance of MobileNetV3 and is suitable for tasks that require high accuracy.
To further validate the model’s generalizability, we conducted a site-held-out test where the model was trained exclusively on publicly available data (Roboflow HV images and online healthy images) and tested on an independent clinical dataset from Beijing Tongren Hospital. As shown in
Table 3, the model achieved 95.23% accuracy and 0.9976 AUC on the hospital test set, demonstrating only a marginal performance drop compared to the mixed-data evaluation (95.52% accuracy, 0.9983 AUC). This minimal discrepancy (
Accuracy: −0.29%,
AUC: −0.0007) suggests that while domain shifts between clinical and online images exist (e.g., standardized hospital lighting vs. heterogeneous backgrounds), the proposed SAM-based data augmentation and CBAM attention mechanism effectively mitigate their impact. The retained high performance (>95% across all metrics) underscores the model’s practical utility in real-world clinical workflows, particularly for large-scale screenings requiring consistent cross-institutional reliability.
Finally, in order to validate that the proposed model is focusing on medically relevant features rather than spurious background artifacts or skin tone, we employed visual explanation techniques using Grad-CAM and XGrad-CAM. As shown in
Figure 6, the highlighted activation regions clearly concentrate around the first metatarsophalangeal joint area—the clinically significant region for hallux valgus detection. These visualizations demonstrate that the model effectively learns to localize and attend to the deformity site associated with HV, instead of relying on irrelevant background patterns, lighting, or color distribution. This suggests a high level of interpretability and anatomical consistency in the model’s decision-making process, which is essential for clinical trust and deployment.
At the same time, we successfully deployed the model and successfully applied it to the app, as shown in
Figure 7.
5. Conclusions
This paper proposes an improved MobileNetV3 model for the detection and diagnosis of hallux valgus in photographs. Instead of expensive X-ray images and time-consuming manual detection, the method we studied uses a deep learning model to detect the presence of hallux valgus in ordinary non-radioactive photographs. The improved MobileNetV3 model takes the original pixels of the image as input. In the reverse residual block, we replace the SE attention mechanism with the CBAM attention mechanism. The spatial attention of CBAM automatically learns to focus on the deformed area near the big toe, rather than spreading the attention to the entire foot, which greatly improves the output prediction results. The improved model outperforms the MobileNetV3 model and other classification models, with an accuracy of 95.52% in detecting hallux valgus from photographs, demonstrating its viability for (1) mass screening in community health campaigns, (2) telemedicine platforms serving remote populations, and (3) longitudinal monitoring of at-risk individuals. For future work, we will integrate this tool with electronic health records and validate its clinical impact through multi-center trials.