1. Introduction
The expansion of online shopping for apparel and footwear, particularly during the COVID-19 pandemic, has reduced the appeal of in-store shoe fittings to ascertain accurate sizing. Consequently, consumers frequently purchase multiple pairs in varying sizes and return those that do not fit, resulting in substantial financial and environmental repercussions. In Germany alone, approximately 286 million items are returned annually [
1], with footwear constituting 21.4% of these returns, primarily due to sizing inaccuracies [
2]. Each return incurs an average cost of €15.18 [
1], translating to an estimated annual financial loss of €4.34 billion and 242,814 tons of CO
2 emissions [
3]. Addressing this challenge necessitates a user-friendly, precise system capable of measuring foot dimensions remotely to recommend the correct shoe size, thereby reducing return rates.
While numerous sizing solutions exist, an increasing number leverage computer vision technologies over traditional measuring tools, employing methods such as photogrammetry from 2D images, 3D point clouds, or image processing with 2D RGB imagery. However, almost all existing approaches require several images or measurements of the foot or reference objects in the image [
4,
5,
6], which makes them difficult for untrained individuals to use. For this reason, this approach uses only one image, which is supplemented by the SnowflakeNet network, which was evaluated for this application [
7], to create a complete scan of the foot. To determine how well the approach works in different recording positions and which position is most advantageous for an application, several camera views and foot rotations are examined. Since millimeters often determine whether a person needs one size or another, it is important that the system be robust in all positions. However, knowing the optimal recording position and suggesting it to the user is extremely helpful. For this reason, this study aims to investigate how the camera position affects the reconstruction accuracy of 3D foot scans using the chosen network.
Owing to the wide range of applications associated with 3D sensor technology, public accessibility to such sensors has significantly expanded. Notably, many contemporary smartphones—such as all Pro models of the iPhone beginning with the iPhone 12 Pro—are now equipped with time-of-flight (ToF) sensors integrated into their main cameras. These sensors are presently employed for tasks such as depth-of-field measurement in photography and three-dimensional spatial analysis, including room measurements. This technological advancement presents a promising opportunity to implement foot scanning solutions directly on consumer devices, thereby eliminating the necessity for users to acquire external hardware or physically visit retail footwear outlets.
2. Related Work
In general, the various methods for measuring a foot using computer vision can be classified into two principal categories: two-dimensional (2D) and three-dimensional (3D) measurements.
Two-dimensional approaches utilize RGB images to perform a measurement of the foot. It is possible to utilize either a single image or multiple images. In the event that multiple images are employed, the generation of a three-dimensional representation of the foot is also feasible through the utilization of techniques such as photogrammetry like in [
8]. Another noteworthy method was proposed by Boyne et al. [
9], who acknowledged that conventional photogrammetry applications necessitate a substantial quantity of images to accurately reconstruct an object. Their FOUND method is a three-dimensional foot geometry reconstruction technique that utilizes surface normals and anatomical keypoints, with uncertainty estimates derived from a few calibrated RGB images. These predictions are employed to optimize a parametric foot model through a multi-view loss that accounts for silhouette alignment, normal consistency, and keypoint accuracy, all weighted by predicted uncertainties. However, since RGB images lack direct size information, a reference object within the image is necessary for all these approaches in order to ascertain the actual size of the foot. As in [
10] or [
11], a reference object, like an A4 sheet of paper or a table tennis ball with a standardized known size, is often employed for this purpose. In the case of a single image being used, it is typical to select an approach to image processing that is more traditional. In such cases, the edges of the foot are extracted from the image, thereby enabling the length and width of the foot to be ascertained through the utilization of a reference object in the image. However, other relevant data, such as instep height or ball-girth, is typically excluded from these methodologies.
Three-dimensional methods employ depth cameras, which can be either time-of-flight or structured-light (SL) cameras. In contrast to two-dimensional methods, direct size data is recorded in three-dimensional space, obviating the need for reference objects in the image. As in [
12] or [
13], multiple scans from different sides of the foot are required to fully digitize and measure it. Another work that has identified the challenges associated with self-scanning the foot is that of Fogarty et al. [
14]. They have refined partial Structure-from-Motion (SfM) reconstructions by resolving alignment ambiguities through a canonicalization module with viewpoint prediction. Subsequently, an attention-based neural network is employed to supplement the existing point cloud with missing geometry. Another approach by Nourbakhsh et al. [
15] involves the use of a 3D scan of a hand to adapt and deform a standardized initialization hand. The 3D models obtained in this manner are subsequently employed for measurement purposes.
3. Method
This section delineates the processing pipeline employed for the reconstruction of incomplete foot scans and the determination of the optimal acquisition position. The methodology is comprised of sequential steps, as illustrated in
Figure 1. Initially, high-resolution foot scans are converted into standardized point clouds and subsequently normalized for network training. Following this, partial point clouds are generated from the complete scans by simulating depth observations from a single view of predefined camera positions. Accordingly, separate training datasets are created for the single-view, foot rotation, and multi-view capture modes. SnowflakeNet is then trained using identical settings on each dataset, thereby allowing the effects of the camera view or, conversely, the resulting partial input geometry to be evaluated independently of architectural changes. Finally, the reconstructed point clouds are compared with the corresponding ground truth point clouds, using measurement based errors for foot length and width, as well as the chamfer distance, to assess overall geometric similarity.
3.1. Network Architecture
In [
7], it was found that the point-/transformer-based network SnowflakeNet [
16] is particularly well-suited for reconstructing incomplete foot scans. Based on this previous evaluation of the network against other architectures, SnowflakeNet was also chosen for this study. The objective of the present work is not to redesign the point cloud completion architecture, but rather to utilise a robust and previously evaluated network to isolate the effect of camera position and foot orientation on reconstruction accuracy. This design choice enables the evaluation of the acquisition geometry to be conducted independently from architectural changes. Essentially, the network comprises three modules: the Feature Extractor, the Seed Generator, and the Point Generator. The Feature Extractor generates a shape code from the incomplete foot scan, encapsulating all essential global and local information of the point cloud. Using this information, the Seed Generator then produces a complete, though not very dense, point cloud. In the Point Generator, this sparse point cloud and the shape code are subsequently employed to iteratively densify the point cloud, achieving an optimal reconstruction of the complete foot. SnowflakeNet generates these dense point clouds via a coarse-to-fine decoding process consisting of multiple Snowflake Point Deconvolution (SPD) layers. Each SPD step splits a parent point into multiple child points, progressively increasing the density of the point cloud from the initial
to the final
,
, and
. Each SPD layer employs a pointwise splitting operation to predict displacement features,
, for duplicated parent points. These displacements are learned from per-point features using multilayer perceptrons (MLPs) and guided by skip transformers, which integrate spatial context from previous SPD layers. This enables the network to adapt splitting patterns to local geometry, thereby improving the reconstruction of smooth and sharp structures alike [
16]. The architecture of SnowflakeNet is illustrated graphically in
Figure 2.
3.2. Hyperparamters
The network was trained on a Nvidia GeForce 4090 using the same hyperparameter configuration for all experimental conditions to ensure comparability between the single-view, foot-rotation, and multi-view datasets. The majority of hyperparameters were adopted from the original SnowflakeNet implementation, as the objective of this study was not to optimise the architecture itself, but rather to evaluate the influence of the camera position. The exact hyperparameters used in this study are shown in
Table 1. It should be noted that the batch size and the number of epochs were the only parameters that were adjusted in order to take into account the available training data and computational resources.
The learning rate was set to 0.001, and the Adam optimizer was used for parameter optimization. The reconstruction loss was based on the L1 version of the Chamfer distance, which directly penalizes geometric discrepancies between the predicted and actual point clouds. Each model was trained for 400 epochs with a batch size of 20. The validation set was utilised to monitor the training process and verify convergence. The identical training protocol was implemented across all acquisition settings, thereby ensuring that any observed variations in reconstruction accuracy can be ascribed to the configuration of the input view rather than to disparities in optimization.
3.3. Foot Data
The dataset for this study is provided by corpus.e AG, located in Stuttgart, Germany, and consists of 1000 high-resolution foot scans (left and right feet) from people around the world, evenly split between male and female subjects. The availability of detailed metadata on age, body height, body weight, and ethnicity was precluded by data protection restrictions. Each file contains a high resolution scan of a foot.
Figure 3 presents a series of six sample meshes selected from the dataset. The original data format, .stl, was converted into a point cloud format, with each full point cloud standardized to 2048 points. Additionally, all point clouds were normalized within a range of [−0.5, 0.5] for training the network. To generate profile views, the Open3D hidden point removal method was applied to the full point clouds, resulting in partial point clouds with 1024 points that simulate a 3D camera capturing the foot from a specific angle. The dataset was then split into training (696 scans), validation (157 scans), and test (153 scans) sets. In instances where scans of both feet were available for a single subject during the process of subject splitting, both scans were assigned to the same subset. This was done in order to prevent information leakage between the training, validation, and test sets. Consequently, no subject is present in more than one subset. This subject-independent splitting strategy is employed to prevent the potential inflation of performance metrics that may be artificially elevated due to the presence of similar foot geometries of the same person in both the training and test data sets.
3.4. Training Datasets
To determine the optimal camera and foot position for capturing measurement data, we created various training datasets. These datasets are not intended as architectural variations of the completion method, but rather as controlled capture scenarios that allow us to quantify the influence of viewing direction, foot rotation, and camera displacement.The arrangement and perspective of the camera on the foot in each data set is illustrated in
Figure 4. The construction of these controlled datasets forms the experimental basis for identifying acquisition conditions that are most suitable for the completion of single-view foot scans.
3.4.1. Single-View
This approach relies on the idea that the best place to start when reconstructing the missing side of the foot is with an image of the foot taken from a central position that shows the center of the foot. Given the variability in foot size, the distance between the camera and the foot was selected to be variable. This ensured that the foot would always be fully visible in the image for every size. The average distance was found to be between 30 and 60 cm, which was deemed a realistic distance for capturing a foot scan. The camera’s focus remains centered on the center of gravity of the foot. In the provided example, the area in question is located between the metatarsal region and the midfoot.
Figure 4a illustrates the positioning of the camera in relation to the foot.
3.4.2. Single-View (Foot Rotation)
Another data set was made to determine the best way to rotate the foot with respect to the camera in order to record any relevant information and guarantee an ideal reconstruction. The camera position remained unchanged from that described in
Section 3.4.1; however, the orientation of the foot towards the camera was altered. The foot was rotated by 45° in both a clockwise and a counterclockwise direction. Samples were collected at five-degree intervals, ranging from 0 to ±45 degrees. The heel region was more noticeable in the clockwise rotation, but the toe region of the foot was more noticeable in the counterclockwise rotation.
Figure 4b illustrates the alignment from the foot to the camera.
3.4.3. Multi-View
The following data set will be employed to ascertain whether the camera position in relation to the foot has an impact on reconstruction quality. For this purpose, a grid of cameras was constructed around the camera position designated in
Section 3.4.1 (camera position 12 in
Figure 4c). The grid was created on a flat plane at a distance appropriate to the size of the foot. In total, there are 25 different camera positions. The outermost camera positions are positioned at a distance of 20 cm from the original central camera position from
Section 3.4.1. In all positions, the cameras are oriented towards the center of the foot. The aforementioned alignment of the cameras to the foot is illustrated in
Figure 4c. In the network reconstruction procedure, each position was used separately to determine the best camera location.
3.5. Algorithmic Procedure
The reconstruction and evaluation procedure was identical for all acquisition conditions. The only difference between the experiments was the method of generating the partial input point clouds. For each full-foot point cloud, one or more virtual camera configurations were defined according to the respective dataset. The visible part of the point cloud was then extracted by removing occluded points and used as incomplete input for the network. The corresponding complete point cloud served as the reconstruction ground truth.
3.6. Training Process
For each dataset, a distinct SnowflakeNet model was trained using paired partial and complete point clouds. The partial point cloud was utilised as the input to the network. During the training process, the network predicted a complete point cloud based on this partial input. The predicted point cloud was then compared to the reference point cloud during validation using the Chamfer-Distance-L1 loss, and the network parameters were updated using the Adam optimiser.
The division of the data into training, validation, and test sets was preserved in all experiments. The training dataset was used for parameter optimisation, the validation dataset for monitoring convergence, and the test dataset exclusively for the quantitative evaluation. This ensures the results reflect the model’s generalisation ability to unknown foot scans, not its performance on the training data.
It is also important to note that only individual camera views were used as input in all datasets. This is especially noteworthy in the context of the multi-view dataset. During the training process, each view was reconstructed individually, with no merging of multiple views.
4. Results
The reconstruction results from incomplete data are presented in the next chapter. The term ’Error’ is used here to refer to the difference between the ground truth and the reconstructed point cloud produced by the designated model.
4.1. Qualitative Point Cloud Results
Figure 5 presents an exemplar reconstruction of a foot point cloud, incorporating all three models. The best camera view for each model was chosen for the reconstruction in this example. The ground truth is shown in the first column, and the network’s input is shown in the second. Since each foot has a single camera location, the input for the single-view model is constant. In the rotation model, the camera’s field of view changes in relation to the foot as a result of the foot rotating. In the multi-view model, one of the 25 views per foot from various camera positions is shown here as an example. However, it should be noted that in the multi-view model not all 25 images are used for reconstruction at once; each view is reconstructed individually. The third column depicts the reconstruction outcome of each model. The results are excellent for all models and are suitable for deriving a shoe size for the foot.
4.2. Quantitative Results
4.2.1. Single-View Results
In the initial phase of the process, a single camera position was established. This is located at the center of the foot. The objective was to ascertain whether the reconstruction accuracy reached the requisite 3.33 mm to predict half a shoe size in accordance with the European system. As illustrated in
Figure 6a, the mean error in foot length estimation was 1.82 mm, which is considerably below the threshold of 3.33 mm. Furthermore, the width of the foot, which is not visible in the scans, can be reconstructed with an average error of 2.43 mm. We also evaluated instep height and ball girth. Since there is no universal method for identifying anatomical landmarks for either measurement, we followed the approach proposed by Jurca et al. [
17], in which measurements are taken relative to the total length of the foot. We determined the instep height as the maximum value within the range of 40% to 65% of total foot length, measured from the heel. Ball girth was determined as the maximum value within the range of 70% to 88% of foot length, also measured from the heel. Additionally, a mesh had to be created from the point cloud to determine the ball girth. The mesh was intersected with a plane at the defined measurement point, and the length of the resulting cross-sectional line was calculated.
Figure 6b shows the results of the instep and girth evaluations. The instep height had an average error of 3.01 mm, while the ball girth had an error of 11.10 mm. The high standard deviations in both measurements are particularly noticeable. This is the result of imprecise measurements, especially for the ball girth. When the reconstruction and the ground truth are not the same length and the measurement point is calculated based on the length, large deviations can occur, even when the overall geometries are very similar.
4.2.2. Single-View (Foot Rotation) Results
The second model should be used to determine the best foot rotation for the camera, which will allow for a better reconstruction. At 0 degrees, the foot is parallel to the camera. As the angle becomes more negative, the heel moves closer to the camera; conversely, as the angle becomes more positive, the forefoot is rotated towards the camera.
Figure 7a illustrates the discrepancies in length at varying angles to the camera. It can be observed that the error increases as the heel comes into focus. Consequently, the forefoot provides a superior foundation for reconstruction with minimal error. A similar outcome can be observed with regard to the width error, as illustrated in
Figure 7b. Likewise, prioritizing the forefoot during the recording process improves the reconstruction of the foot and its dimensions.
Figure 7c shows the difference in instep height measurements. Again, it can be seen that the reconstruction results of the network deteriorate the more the heel of the foot is rotated into the image.
Figure 7d also reflects this result, showing the evaluation of ball girth.
4.2.3. Multi-View Results
Figure 8a illustrates the discrepancy between the foot length of the reconstructed point cloud from the multi-view model. The colors of the bars are consistent with those used in
Figure 4c and indicate the positions of the camera in the grid. It is evident that the reconstruction outcomes deteriorate as the camera is positioned closer to the heel. It can thus be assumed that a more comprehensive representation of the forefoot during the data acquisition process, as opposed to a focus on the heel, would yield a higher level of information content. The results regarding the foot’s width are shown in
Figure 8b. It is noteworthy that the camera views that focus on the sole of the foot, such as views 0, 5, 10, 15 and 20, demonstrate a particularly low error. However, this view is challenging to capture for a single individual. Furthermore, it can be observed that the image captured from a neutral position, as in positions 2, 7, 12, 17 and 22, is inadequate for an exact reconstruction of the foot width. This is likely due to the fact that in this example, less information about the shape of the foot is recorded compared to a view from a steeper angle below or above. Although the images captured from the top are of inferior quality to those taken from below, they offer a considerably more comfortable recording process without the need for external assistance. Additionally, they continue to yield data that are accurate enough to predict shoe size.
As shown in
Figure 8c, when measuring the instep height, it is evident that camera positions more towards the forefoot leads to more accurate instep height reconstruction. Furthermore, the graph for ball girth, displayed in
Figure 8d, demonstrates that images from the second row from the bottom (positions 1, 6, and 11) yield favorable outcomes. The best result is achieved from position 20. The worst results are from the positions at the very top.
4.3. Real ToF Results
To evaluate the extent of generalization achieved by the network trained on synthetically generated partial views, a 3D-printed foot derived from the test dataset was recorded using a real time-of-flight (ToF) camera. The corresponding complete 3D model was not used during training and served as the ground-truth reference for this experiment. A 3D-printed foot was selected because it provides a stable and repeatable object geometry and enables direct comparison between the captured real ToF point clouds, the reconstructed point clouds, and the known reference geometry. In this experiment, the camera, a Microsoft Azure Kinect, was mounted in a configuration analogous to the multi-view setup depicted in
Figure 4c. A point cloud was captured from each position. The Azure Kinect was chosen instead of a smartphone because it provides direct access to the depth data and allows more controlled acquisition settings. This reduces the influence of proprietary smartphone post-processing and filtering, which are often not fully accessible to the user. Although the Azure Kinect and smartphone-based ToF systems differ in hardware and internal processing, both rely on active depth sensing and therefore provide a suitable first step for assessing the transfer from simulated partial point clouds to real ToF measurements. The captured point clouds were then reconstructed using the weights obtained from training on the simulated multi-view dataset. A total of 100 recordings were collected from each of the 25 positions within the multi-view setup. This process yielded a total of 2500 partial point clouds, derived from these 25 defined perspectives.
Figure 9a,b present two exemplar partial input point clouds captured by the ToF camera, along with the ground truth and the complete point clouds generated by the network. Visually, the results are close to the ground truth, indicating that the network has generalized well and can handle the domain transfer from simulated to real time-of-flight data.
Figure 10 displays the quantitative results of the ToF dataset reconstructions.
Figure 10a shows the length estimation errors for the example foot from all camera positions. On average, an error of 4.29 mm is achieved. The more central positions appear to yield better results.
Figure 10b shows the errors in foot width, with an average error of 4.30 mm. However, no discernible pattern emerges for the different positions.
Figure 10c shows that positions taken from a relatively parallel camera height yield a smaller error in measuring the instep height. On average, the error is 2.35 mm.
Figure 10d shows an average error of 23.16 mm in measuring the ball girth. However, since this value is the sum of errors from the camera, the meshing process, determining the measurement point, and reconstruction, it should be interpreted with caution.
4.4. Overall Geometry Results
This study focused on foot length and width. However, the best fit is also greatly influenced by the precision of the overall foot geometry. We also calculated the chamfer distance (CD) [
18] for each model to provide a sense of the proximity of the reconstructed foot geometry to the ground truth. CD is a commonly used metric to determine the similarity of two point clouds. The metric has also been used in many other works, such as [
19,
20,
21]. The metric is calculated by adding the squared distances between the nearest neighbors of two point clouds. The chamfer distance gets calculated as follows:
There is no unit for the chamfer distance; it is a relative value. However, generally speaking, the better the outcome and the more comparable the point clouds are, the lower the CD.
Table 2 shows the results of the different models. It can be seen, the model trained using the multi-view dataset yields the best results. These results support the central thesis of this study: When reconstructing a foot based on a single view, the partial input geometry, and thus the position from which the recording was taken, has a measurable impact on both measurement accuracy and the reconstruction of the overall shape. The primary methodological value of the study therefore lies in quantifying which imaging conditions are best suited for measurement-oriented foot reconstruction.
5. Discussion
This study primarily relies on absolute errors in foot length and width because these quantities are directly relevant to shoe size recommendations and can be measured in millimeters. Therefore, they provide an intuitive assessment of whether the reconstructed foot geometry is accurate enough for the intended application. However, additional anatomical measurements, such as instep height and ball girth, were also considered because they are useful for designing and fitting footwear. These measures provide more detailed information about the reconstructed foot shape and may be useful for footwear experts.
Nevertheless, these additional measurements should be interpreted with caution. Unlike foot length and width, which can be derived from the global extremes of the reconstructed geometry, measurements such as instep height or girth depend on the localization of the anatomical measurement regions. In our implementation, these regions are defined relative to the overall foot length. Consequently, small deviations in reconstructed length or local geometry can shift the derived measurement positions and affect the results. This can result in larger errors, even when the reconstructed overall shape closely resembles the ground truth visually and geometrically. Therefore, these local errors may reflect uncertainty in the measurement procedure rather than deficiencies in the reconstruction itself.
An additional source of uncertainty applies specifically to the measurement of the ball girth. To determine the ball girth, a mesh was created from the reconstructed point cloud. The mesh was intersected with a plane at the defined measurement position, and the length of the resulting cross-sectional contour was calculated. This procedure introduces processing steps that are not part of point-cloud reconstruction. Meshing, hole filling, and smoothing, in particular, can locally alter the reconstructed surface and therefore influence the resulting girth value. Consequently, deviations in the ball girth measurement may originate from the mesh reconstruction and post-processing algorithm rather than the neural network reconstruction. Therefore, the ball girth error should not be interpreted as a pure point-cloud completion error, but rather as the combined effect of point-cloud reconstruction, mesh generation, and subsequent geometric measurement.
For this reason, the Chamfer Distance is considered the more general metric for evaluating overall reconstruction quality. Unlike individual measurements, the Chamfer Distance evaluates the geometric similarity between the reconstructed and ground truth point clouds. This makes it less dependent on predefined anatomical measurement locations and better suited for comparing the performance of different reconstruction models overall. Nevertheless, the anthropometric measurements are retained in the evaluation because they provide information relevant to footwear specialists and allow the reconstruction results to be interpreted in terms of meaningful foot dimensions.
A similar limitation applies to evaluations using real ToF data. For this study, a 3D-printed foot was captured using a real ToF camera to demonstrate the feasibility of using a model trained on simulated partial views to process real sensor data. However, this experiment does not represent population-level validation and cannot capture the full variability of human foot shapes, skin and material properties, recording conditions, or sensor-specific effects. Additionally, ToF cameras introduce technology dependent measurement errors, including depth noise, edge artifacts and multipath effects. Although the captured point clouds were preprocessed to reduce these artifacts, they could not be completely eliminated. Consequently, deviations observed in the real ToF experiment should not be interpreted as solely reconstruction errors of the neural network. Rather, they reflect the combined influence of sensor related depth errors, preprocessing, and the subsequent point-cloud completion process.
6. Conclusions
The findings of this study provide practical guidance for the development of applications that enable users to measure their own feet at home. Utilising SnowflakeNet as a established point cloud completion backbone, a systematic evaluation was conducted to ascertain the impact of single-view acquisition geometry on reconstruction and measurement accuracy. The results demonstrate that accurate foot reconstruction is feasible from a single partial scan across different camera positions. In particular, the multi-view training setup achieved an average length error of 1.31 mm and an average width error of 1.64 mm, which is below the required threshold of 3.33 mm corresponding to half a shoe size in the European sizing system. By understanding the best foot alignment and camera position, developers can more efficiently construct a user-friendly scenario that optimizes high quality measures with just one shot. Additionaly, it turned out that an image showing more of the forefoot is ideal because the forefoot contains more information for reconstruction than the heel area. The optimal position for foot scans is from below, capturing more of the bottom of the foot. However, since this position is difficult for a single person to reach, capturing from an elevated position is also a viable alternative that leads to precise results.
A current limitation, however, is that further research into domain transfer to other hardware platforms is still required. Neural networks are known to process unrecognized data more poorly than learned, known patterns. Future studies will investigate how well the method generalizes with a greater number of input data from real ToF camera systems and the extent to which the model may need to be adapted.
Author Contributions
The whole conceptualization and methodology was discussed and agreed to by authors M.J., J.E. and D.W.C. The writing—original draft was carried out by M.J. Validation and Evaluation of the results was discussed and agreed to by all authors. Writing—review and editing was carried out by J.E., D.W.C. and M.J. The whole project was conducted under the supervision of J.E. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the Federal Ministry for Economic Affairs and Energy with the grant number KK5105804BA4.
Data Availability Statement
The data presented in this study are not publicly available because they are the property of corpus.e AG. Due to company restrictions, the authors are not permitted to share the underlying data.
Acknowledgments
The authors would like to thank corpus.e AG for providing us with their data.
Conflicts of Interest
The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
References
- Asdecker, B. Statistiken Retouren Deutschland-Definition, 2026. Retourenforschung.de. 2026. Available online: http://www.retourenforschung.de/definition_statistiken-retouren-deutschland.html (accessed on 25 February 2026).
- Brandt, M. Infografik: Bekleidung und Schuhe werden am häufigsten retourniert | Statista. 2025. Available online: https://de.statista.com/infografik/23972/befragte-die-online-bestellte-artikel-zurueckgeschickt-haben/ (accessed on 25 February 2026).
- Asdecker. CO2-Bilanz einer Retoure-Definition, 2026. Retourenforschung.de. 2026. Available online: http://www.retourenforschung.de/definition_co2-bilanz-einer-retoure.html (accessed on 25 February 2026).
- Wang, J.; Saito, H.; Kimura, M.; Mochimaru, M.; Kanade, T. Shape reconstruction of human foot from multi-camera images based on PCA of human shape database. In Proceedings of the Fifth International Conference on 3-D Digital Imaging and Modeling (3DIM’05); IEEE: Piscataway, NJ, USA, 2005; pp. 424–431. [Google Scholar]
- Luximon, A.; Goonetilleke, R.S.; Zhang, M. 3D foot shape generation from 2D information. Ergonomics 2005, 48, 625–641. [Google Scholar] [CrossRef] [PubMed]
- Rafiq, R.B.; Hoque, K.M.; Kabir, M.A.; Ahmed, S.; Laird, C. Optifit: Computer-vision-based smartphone application to measure the foot from images and 3d scans. Sensors 2022, 22, 9554. [Google Scholar] [CrossRef] [PubMed]
- Jäger, M.; Eberhardt, J.; Cunningham, D.W. 3D reconstruction of partial foot scans using different state of the art neural network approaches. Footwear Sci. 2024, 16, 105–114. [Google Scholar] [CrossRef]
- Hu, K.; Zhong, Y.; Wu, G. Reconstruction of 3D foot model from video captured using smartphone camera. J. Fiber Bioeng. Inform. 2015, 8, 493–500. [Google Scholar] [CrossRef]
- Boyne, O.; Bae, G.; Charles, J.; Cipolla, R. Found: Foot optimization with uncertain normals for surface deformation using synthetic data. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; IEEE: Piscataway, NJ, USA, 2024; pp. 8097–8106. [Google Scholar]
- Wang, H.; Liu, F.; Fan, R. A research on foot size measurement algorithm based on image. J. Phys. Conf. Ser. 2021, 1903, 012004. [Google Scholar] [CrossRef]
- Xiong, S.; Li, Y.; Zhu, Y.; Qian, J.; Yang, D. Foot measurements from 2D digital images. In Proceedings of the 2010 IEEE International Conference on Industrial Engineering and Engineering Management; IEEE: Piscataway, NJ, USA, 2010; pp. 497–501. [Google Scholar] [CrossRef]
- Hong, R.; Li, J. Robust 3-D Reconstruction and Parameter Measurement of the Foot Using Multiple Depth Cameras. IEEE Trans. Instrum. Meas. 2023, 72, 5021012. [Google Scholar] [CrossRef]
- Chen, Y.S.; Chen, Y.C.; Kao, P.Y.; Shih, S.W.; Hung, Y.P. Estimation of 3-D Foot Parameters Using Hand-Held RGB-D Camera. In Proceedings of the Computer Vision—ACCV 2014 Workshops; Jawahar, C.V., Shan, S., Eds.; Springer: Cham, Switzerland, 2015; pp. 407–418. [Google Scholar]
- Fogarty, K.; Yang, J.; Patodi, C.K.; Bhanti, A.; Chacko, S.; Oztireli, C.; Bonde, U. Best Foot Forward: Robust Foot Reconstruction in-the-wild. arXiv 2025, arXiv:2502.20511. [Google Scholar] [CrossRef]
- Nourbakhsh Kaashki, N.; Dai, X.; Gyarmathy, T.; Hu, P.; Iancu, B.; Munteanu, A. A deep learning approach to automatically extract 3d hand measurements. In Proceedings of the 2022 7th International Conference on Machine Learning Technologies; IEEE: Piscataway, NJ, USA, 2022; pp. 141–146. [Google Scholar]
- Xiang, P.; Wen, X.; Liu, Y.S.; Cao, Y.P.; Wan, P.; Zheng, W.; Han, Z. SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 5479–5489. [Google Scholar] [CrossRef]
- Jurca, A.; Žabkar, J.; Džeroski, S. Analysis of 1.2 million foot scans from North America, Europe and Asia. Sci. Rep. 2019, 9, 19155. [Google Scholar] [CrossRef] [PubMed]
- Fan, H.; Su, H.; Guibas, L.J. A Point Set Generation Network for 3D Object Reconstruction From a Single Image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Xie, H.; Yao, H.; Zhou, S.; Mao, J.; Zhang, S.; Sun, W. Grnet: Gridding residual network for dense point cloud completion. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 365–381. [Google Scholar]
- Huang, Z.; Yu, Y.; Xu, J.; Ni, F.; Le, X. PF-Net: Point Fractal Network for 3D Point Cloud Completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Tchapmi, L.P.; Kosaraju, V.; Rezatofighi, H.; Reid, I.; Savarese, S. TopNet: Structural Point Cloud Decoder. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Figure 1.
Methodology used by us for the reconstruction and measurement of reconstruction results.
Figure 1.
Methodology used by us for the reconstruction and measurement of reconstruction results.
Figure 2.
Network Architecture of SnowflakeNet for foot shape completion [
16].
Figure 2.
Network Architecture of SnowflakeNet for foot shape completion [
16].
Figure 3.
Examples of meshes from the dataset provided by corpus.e.
Figure 3.
Examples of meshes from the dataset provided by corpus.e.
Figure 4.
Overview of camera configurations for different training datasets. (a) Camera position for the single-view training dataset. (b) Camera position for the single-view with foot rotation training dataset. (c) Camera positions for the multi-view training dataset.
Figure 4.
Overview of camera configurations for different training datasets. (a) Camera position for the single-view training dataset. (b) Camera position for the single-view with foot rotation training dataset. (c) Camera positions for the multi-view training dataset.
Figure 5.
Summary of point cloud results. (a) Example point cloud result from the single-view dataset. (b) Example point cloud result from the rotation dataset. (c) Example point cloud result from the multi-view dataset.
Figure 5.
Summary of point cloud results. (a) Example point cloud result from the single-view dataset. (b) Example point cloud result from the rotation dataset. (c) Example point cloud result from the multi-view dataset.
Figure 6.
Quantitative width, length, instep, and girth results for the single-view dataset. The colored bars represent the mean values of the respective positions, with the standard deviations indicated in black. The estimates are derived from a dataset comprising 153 data points. (a) Length and width results from the single-view model. (b) Instep and ball girth results from the single-view model.
Figure 6.
Quantitative width, length, instep, and girth results for the single-view dataset. The colored bars represent the mean values of the respective positions, with the standard deviations indicated in black. The estimates are derived from a dataset comprising 153 data points. (a) Length and width results from the single-view model. (b) Instep and ball girth results from the single-view model.
Figure 7.
Quantitative width, length, instep, and girth results for the rotation-view dataset. The colored bars represent the mean values of the respective positions, with the standard deviations indicated in black. The estimates are derived from a dataset comprising 153 data points. (a) Length results from the rotation dataset. (b) Width results from the rotation dataset. (c) Instep results from the rotation dataset. (d) Ball girth results from the rotation dataset.
Figure 7.
Quantitative width, length, instep, and girth results for the rotation-view dataset. The colored bars represent the mean values of the respective positions, with the standard deviations indicated in black. The estimates are derived from a dataset comprising 153 data points. (a) Length results from the rotation dataset. (b) Width results from the rotation dataset. (c) Instep results from the rotation dataset. (d) Ball girth results from the rotation dataset.
Figure 8.
Quantitative width, length, instep, and girth results for the multi-view dataset. The colored bars represent the mean values of the respective positions, with the standard deviations indicated in black. The estimates are derived from a dataset comprising 153 data points. (a) Length results from the multi-view dataset. (b) Width results from the multi-view dataset. (c) Instep results from the multi-view dataset. (d) Ball girth results from the multi-view dataset.
Figure 8.
Quantitative width, length, instep, and girth results for the multi-view dataset. The colored bars represent the mean values of the respective positions, with the standard deviations indicated in black. The estimates are derived from a dataset comprising 153 data points. (a) Length results from the multi-view dataset. (b) Width results from the multi-view dataset. (c) Instep results from the multi-view dataset. (d) Ball girth results from the multi-view dataset.
Figure 9.
Examples of actual ToF recordings and their reconstruction results. (a) Example point cloud result from the ToF dataset captured from camera position 0. (b) Example point cloud result from the ToF dataset captured from camera position 12.
Figure 9.
Examples of actual ToF recordings and their reconstruction results. (a) Example point cloud result from the ToF dataset captured from camera position 0. (b) Example point cloud result from the ToF dataset captured from camera position 12.
Figure 10.
Quantitative width, length, instep, and girth results for the ToF dataset. The colored bars represent the mean values of the respective positions, with the standard deviations indicated in black. (a) Length results from the ToF dataset. (b) Width results from the ToF dataset. (c) Instep results from the ToF dataset. (d) Ball girth results from the ToF dataset.
Figure 10.
Quantitative width, length, instep, and girth results for the ToF dataset. The colored bars represent the mean values of the respective positions, with the standard deviations indicated in black. (a) Length results from the ToF dataset. (b) Width results from the ToF dataset. (c) Instep results from the ToF dataset. (d) Ball girth results from the ToF dataset.
Table 1.
Hyperparameters for SnowflakeNet.
Table 1.
Hyperparameters for SnowflakeNet.
| Parameter | SnowflakeNet |
|---|
| Learning Rate | 0.001 |
| Epochs | 400 |
| Batch Size | 20 |
| Loss Function | CD L1 |
| Optimizer | Adam |
Table 2.
Quantitative overall shape results for all datasets.
Table 2.
Quantitative overall shape results for all datasets.
| Metric | | Single | Rotation | Multi | ToF |
|---|
| View | View | View | Multi |
|---|
| Chamfer | ↓ | 5.024 | 6.395 | 3.860 | 8.785 |
| Distance |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Published by MDPI on behalf of the International Institute of Knowledge Innovation and Invention. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.