Figure 1.
The overall architecture of the proposed model. The main components consist of input data, feature extraction, cost volume construction, cost aggregation, and disparity prediction.
Figure 1.
The overall architecture of the proposed model. The main components consist of input data, feature extraction, cost volume construction, cost aggregation, and disparity prediction.
Figure 2.
The 4D cost volume generation method. The left feature is shifted from to . In this process, the cost volume is constructed to cover the disparity range from .
Figure 2.
The 4D cost volume generation method. The left feature is shifted from to . In this process, the cost volume is constructed to cover the disparity range from .
Figure 3.
The detailed block structure of the uncertainty attention module comprises uncertainty estimation, weight generation, attention, and cost volume fusion.
Figure 3.
The detailed block structure of the uncertainty attention module comprises uncertainty estimation, weight generation, attention, and cost volume fusion.
Figure 4.
Qualitative evaluation of disparity maps generated by various methods. From top to bottom: reference image, DenseMapNet [
34], StereoNet [
35], PSMNet [
13], Gwc-Net [
24], DSM-Net [
14], HMSM-net [
15], UGC-Net (ours), and ground truth. Image numbers from left to right are OMA-132-002-034, OMA-212-008-030, OMA-225-028-026, OMA-331-024-026, and OMA-315-001-025.
Figure 4.
Qualitative evaluation of disparity maps generated by various methods. From top to bottom: reference image, DenseMapNet [
34], StereoNet [
35], PSMNet [
13], Gwc-Net [
24], DSM-Net [
14], HMSM-net [
15], UGC-Net (ours), and ground truth. Image numbers from left to right are OMA-132-002-034, OMA-212-008-030, OMA-225-028-026, OMA-331-024-026, and OMA-315-001-025.
Figure 5.
Qualitative evaluation of disparity maps generated by various methods on the WHU-Stereo dataset. From top to bottom: reference image, StereoNet, PSMNet, HMSM-net, UGC-Net (ours), and ground truth. Image numbers from left to right are HY_left_0, KM_left_81, QC_left_376, QC_left_465, and YD_left_495.
Figure 5.
Qualitative evaluation of disparity maps generated by various methods on the WHU-Stereo dataset. From top to bottom: reference image, StereoNet, PSMNet, HMSM-net, UGC-Net (ours), and ground truth. Image numbers from left to right are HY_left_0, KM_left_81, QC_left_376, QC_left_465, and YD_left_495.
Figure 6.
Qualitative evaluation of the disparity maps for texture-less areas. From top to bottom: reference image, Gwc-Net, DSM-Net, HMSM-Net, UGC-Net (ours), and ground truth. From left to right, the image numbers are OMA-247-027-001, OMA-251-001-006, OMA-342-004-031, and OMA-383-005-025. Major differences in the disparity images of each method are annotated with red circles in the figures.
Figure 6.
Qualitative evaluation of the disparity maps for texture-less areas. From top to bottom: reference image, Gwc-Net, DSM-Net, HMSM-Net, UGC-Net (ours), and ground truth. From left to right, the image numbers are OMA-247-027-001, OMA-251-001-006, OMA-342-004-031, and OMA-383-005-025. Major differences in the disparity images of each method are annotated with red circles in the figures.
Figure 7.
Qualitative evaluation of the disparity maps for discontinuities and occlusions areas. From top to bottom: reference image, Gwc-Net, DSM-Net, HMSM-Net, UGC-Net (ours), and ground truth. From left to right, the image numbers are OMA-212-008-006, OMA-225-027-021, OMA-281-006-027, and OMA-288-039-003. Major differences in the disparity images of each method are annotated with red circles in the figures.
Figure 7.
Qualitative evaluation of the disparity maps for discontinuities and occlusions areas. From top to bottom: reference image, Gwc-Net, DSM-Net, HMSM-Net, UGC-Net (ours), and ground truth. From left to right, the image numbers are OMA-212-008-006, OMA-225-027-021, OMA-281-006-027, and OMA-288-039-003. Major differences in the disparity images of each method are annotated with red circles in the figures.
Figure 8.
Qualitative evaluation of the disparity maps for repetitive pattern areas. From top to bottom: reference image, Gwc-Net, DSM-Net, HMSM-Net, UGC-Net (ours), and ground truth. From left to right, the image numbers are OMA-132-002-034, OMA-276-036-032, OMA-374-036-034, and OMA-391-025-019. Major differences in the disparity images of each method are annotated with red circles in the figures.
Figure 8.
Qualitative evaluation of the disparity maps for repetitive pattern areas. From top to bottom: reference image, Gwc-Net, DSM-Net, HMSM-Net, UGC-Net (ours), and ground truth. From left to right, the image numbers are OMA-132-002-034, OMA-276-036-032, OMA-374-036-034, and OMA-391-025-019. Major differences in the disparity images of each method are annotated with red circles in the figures.
Figure 9.
The figure shows the architectural structures corresponding to each ablation model. We conducted experiments by removing specific components from the proposed model. (a) represents UGC_ablation1, where only the geometry cost volume is aggregated to estimate disparity. (b) corresponds to UGC_ablation2, which generates a context cost volume, multiplies it with geometry cost volume to form the attention feature volume (AFV) method, and then performs cost aggregation. (c) represents UGC_ablation3, where only the cost volume fusion (CVF) module within the uncertainty attention module of the proposed method is used. (d) corresponds to UGC_ablation4, where the weight generated using the uncertainty map is applied only to the geometry cost volume in the proposed method.
Figure 9.
The figure shows the architectural structures corresponding to each ablation model. We conducted experiments by removing specific components from the proposed model. (a) represents UGC_ablation1, where only the geometry cost volume is aggregated to estimate disparity. (b) corresponds to UGC_ablation2, which generates a context cost volume, multiplies it with geometry cost volume to form the attention feature volume (AFV) method, and then performs cost aggregation. (c) represents UGC_ablation3, where only the cost volume fusion (CVF) module within the uncertainty attention module of the proposed method is used. (d) corresponds to UGC_ablation4, where the weight generated using the uncertainty map is applied only to the geometry cost volume in the proposed method.
Figure 10.
Disparity and uncertainty maps for each image. The images are organized from top to bottom as follows: the input image, disparity estimation image, uncertainty map, error map, and ground truth. Higher uncertainty values are observed in challenging regions such as texture-less areas, occlusions, and repetitive patterns. Additionally, by comparing the error and uncertainty maps, we can observe that they share a substantial similarity.
Figure 10.
Disparity and uncertainty maps for each image. The images are organized from top to bottom as follows: the input image, disparity estimation image, uncertainty map, error map, and ground truth. Higher uncertainty values are observed in challenging regions such as texture-less areas, occlusions, and repetitive patterns. Additionally, by comparing the error and uncertainty maps, we can observe that they share a substantial similarity.
Table 1.
The detailed architecture of the uncertainty attention module. The proposed module is divided into three parts.
Table 1.
The detailed architecture of the uncertainty attention module. The proposed module is divided into three parts.
Name | Layer Properties | Input | Output Size |
---|
Uncertainty estimation block |
Geometry_3D_Conv | Reduces 4D cost volume to 3D cost volume | [Geometry cost volume] | |
Softmax output | Estimates probability of cost volume | [Geometry_3D_Conv] | |
Disparity output | Computes disparity | [Softmax output] | |
Uncertainty map | Estimates variance uncertainty of cost volume | [Softmax output, Disparity output] | |
Weight generation block |
Uncertainty_2D_Conv | Two-dimensional convolution for uncertainty map | [Uncertainty map] | |
Weight | Expands disparity dimension and sigmoid | [Uncertainty_2D_Conv] | |
Attention block |
Context_3D_Conv | Three-dimensional convolution for context cost volume | [Context cost volume] | |
Attention Context | Attention for context cost volume | [Context_3D_Conv, Weight] | |
Attention geometry | Attention for geometry cost volume | [Geometry cost volume, Weight] | |
Fusion block |
CVF module | Cost volume fusion module | [Attention Geometry, Attention Context] | |
Table 2.
Description and usage of the US3D dataset. The Jacksonville dataset is used for training and validation, while the Omaha dataset is only used for generalization evaluation tests.
Table 2.
Description and usage of the US3D dataset. The Jacksonville dataset is used for training and validation, while the Omaha dataset is only used for generalization evaluation tests.
Dataset | Mode | Image Size | City | Training/Validation/Testing | Usage |
---|
US3D | RGB | 1024 × 1024 | Jacksonville | 1500/139/500 | Training, validation, and testing |
| | | Omaha | 0/0/2153 | Testing |
Table 3.
Information and usage of the WHU-Stereo dataset used in the experiment. The table lists each city’s training, validation, and test splits of stereo images.
Table 3.
Information and usage of the WHU-Stereo dataset used in the experiment. The table lists each city’s training, validation, and test splits of stereo images.
Dataset | Mode | Image Size | City | Training/Validation/Testing | Total |
---|
WHU-Stereo | Panchromatic | 1024 × 1024 | Shaoguan | 20/5/23 | 48 |
| | | Kunming | 200/3/50 | 253 |
| | | Yingde | 400/34/100 | 534 |
| | | Qichun | 600/80/200 | 880 |
| | | Wuhan | -/-/20 | 20 |
| | | Hengyang | -/-/22 | 22 |
| | | Total | 1220/122/415 | 1757 |
Table 4.
Quantitative comparison of various methods on the Omaha city data from the US3D dataset. The best results are highlighted in bold. Our network results are presented with two disparity ranges: UGC-Net (our range [−64, 64]) evaluated at a disparity range of [−64, 64] and UGC-Net (our range [−96, 96]) evaluated at a disparity range of [−96, 96]. Lower is better for the metric of EPE (Pixel), D1 (%).
Table 4.
Quantitative comparison of various methods on the Omaha city data from the US3D dataset. The best results are highlighted in bold. Our network results are presented with two disparity ranges: UGC-Net (our range [−64, 64]) evaluated at a disparity range of [−64, 64] and UGC-Net (our range [−96, 96]) evaluated at a disparity range of [−96, 96]. Lower is better for the metric of EPE (Pixel), D1 (%).
Method | EPE (Pixel) | D1 (%) |
---|
DenseMapNet [34] | 2.0490 | 17.48 |
Stereonet [35] | 1.6496 | 11.41 |
PSMNet [13] | 1.5073 | 9.39 |
GwcNet [24] | 1.4887 | 8.76 |
DSM-Net [14] | 1.4757 | 8.73 |
HMSM-Net [15] | 1.4138 | 7.82 |
CMSP-Net [36] | 1.365 | 7.12 |
UGC-Net (our range [−96, 96]) | 1.3686 | 7.26 |
UGC-Net (our range [−64, 64]) | 1.3557 | 7.24 |
Table 5.
Quantitative comparisons of different methods on different cities of the WHU-Stereo dataset. Lower is better for the metrics of EPE (pixel) and D1 (%). Bold: best.
Table 5.
Quantitative comparisons of different methods on different cities of the WHU-Stereo dataset. Lower is better for the metrics of EPE (pixel) and D1 (%). Bold: best.
City | Shaoguan | Kunming | Yingde | Qichun | Wuhan | Hengyang |
---|
Method | EPE | D1 | EPE | D1 | EPE | D1 | EPE | D1 | EPE | D1 | EPE | D1 |
---|
SGM [18] | 7.862 | 63.19 | 3.065 | 37.50 | 5.497 | 47.22 | 4.484 | 54.19 | 6.790 | 58.65 | 9.081 | 61.82 |
StereoNet [35] | 2.660 | 24.58 | 1.225 | 6.09 | 1.821 | 14.69 | 2.801 | 34.63 | 4.836 | 47.83 | 4.375 | 43.50 |
PSMNet [13] | 2.514 | 21.45 | 1.181 | 5.77 | 1.993 | 15.09 | 2.953 | 37.04 | 4.105 | 36.49 | 3.761 | 31.23 |
HMSM-Net [15] | 2.091 | 16.94 | 1.017 | 4.44 | 1.436 | 9.86 | 1.596 | 13.65 | 3.905 | 35.66 | 2.914 | 23.43 |
UGC-Net (Ours) | 1.985 | 16.20 | 0.999 | 4.61 | 1.406 | 9.30 | 1.551 | 13.00 | 3.582 | 31.98 | 2.727 | 20.79 |
Table 6.
Quantitative comparison of various methods on the entire test dataset of the WHU dataset. The best results are highlighted in bold. For both the EPE (pixels) and D1 (%) metrics, lower values indicate better performance.
Table 6.
Quantitative comparison of various methods on the entire test dataset of the WHU dataset. The best results are highlighted in bold. For both the EPE (pixels) and D1 (%) metrics, lower values indicate better performance.
Method | EPE (Pixel) | D1 (%) | Time (ms) |
---|
SGM [18] | 4.889 | 50.79 | 506 |
StereoNet [35] | 2.453 | 25.12 | 94 |
PSMNet [13] | 2.481 | 24.81 | 232 |
HMSM-Net [15] | 1.672 | 12.94 | 150 |
Cascaded Recurrent Network (CRN) [37] | 1.66 | 12.87 | - |
UGC-Net (Ours) | 1.610 | 12.19 | 217 |
Table 7.
Quantitative evaluation of individual images in texture-less areas. The baseline Gwc-Net, DSM-Net, HMSM-Net, and UGC-Net (ours) are evaluated for average endpoint error (EPE) and the fraction of erroneous pixels (D1). The best results are highlighted in bold.
Table 7.
Quantitative evaluation of individual images in texture-less areas. The baseline Gwc-Net, DSM-Net, HMSM-Net, and UGC-Net (ours) are evaluated for average endpoint error (EPE) and the fraction of erroneous pixels (D1). The best results are highlighted in bold.
Image | OMA-247-027-001 | OMA-251-001-006 | OMA-342-004-031 | OMA-383-005-025 |
---|
Method | EPE | D1 | EPE | D1 | EPE | D1 | EPE | D1 |
---|
GwcNet [24] | 1.25624 | 4.899 | 2.97991 | 30.739 | 0.91757 | 3.482 | 1.41832 | 6.008 |
DSM-Net [14] | 1.10472 | 4.673 | 2.96452 | 19.015 | 0.94054 | 3.974 | 1.35677 | 5.180 |
HMSM-Net [15] | 1.37804 | 6.641 | 2.83565 | 18.668 | 1.02536 | 3.692 | 1.17001 | 5.844 |
UGC-Net (Ours) | 1.08337 | 4.420 | 2.09018 | 12.174 | 1.16741 | 3.034 | 0.81462 | 3.123 |
Table 8.
Quantitative evaluation of individual images in discontinuities and occlusions area. The baseline Gwc-Net, DSM-Net, HMSM-Net, and UGC-Net (ours) are evaluated for average endpoint error (EPE) and the fraction of erroneous pixels (D1). The best results are highlighted in bold.
Table 8.
Quantitative evaluation of individual images in discontinuities and occlusions area. The baseline Gwc-Net, DSM-Net, HMSM-Net, and UGC-Net (ours) are evaluated for average endpoint error (EPE) and the fraction of erroneous pixels (D1). The best results are highlighted in bold.
Image | OMA-212-008-006 | OMA-225-027-021 | OMA-281-006-027 | OMA-288-039-003 |
---|
Method | EPE | D1 | EPE | D1 | EPE | D1 | EPE | D1 |
---|
GwcNet [24] | 1.51313 | 6.992 | 1.69105 | 7.959 | 1.69062 | 6.770 | 1.48672 | 5.875 |
DSM-Net [14] | 1.34765 | 5.925 | 1.69149 | 9.959 | 1.72615 | 6.324 | 1.43775 | 4.770 |
HMSM-Net [15] | 1.33740 | 4.951 | 1.09336 | 3.874 | 0.81109 | 2.130 | 1.42740 | 3.840 |
UGC-Net (Ours) | 1.05042 | 2.977 | 0.86374 | 2.608 | 0.77426 | 1.983 | 1.17499 | 3.164 |
Table 9.
Quantitative evaluation of individual images in repetitive pattern area. The baseline Gwc-Net, DSM-Net, HMSM-Net, and UGC-Net (ours) are evaluated for average endpoint error (EPE) and the fraction of erroneous pixels (D1). The best results are highlighted in bold.
Table 9.
Quantitative evaluation of individual images in repetitive pattern area. The baseline Gwc-Net, DSM-Net, HMSM-Net, and UGC-Net (ours) are evaluated for average endpoint error (EPE) and the fraction of erroneous pixels (D1). The best results are highlighted in bold.
Image | OMA-132-002-034 | OMA-276-036-032 | OMA-374-036-034 | OMA-391-025-019 |
---|
Method | EPE | D1 | EPE | D1 | EPE | D1 | EPE | D1 |
---|
GwcNet [24] | 1.55693 | 9.062 | 2.17002 | 21.932 | 2.26525 | 30.075 | 2.23405 | 16.180 |
DSM-Net [14] | 1.60977 | 10.789 | 2.22618 | 24.362 | 2.24347 | 29.455 | 2.32619 | 18.840 |
HMSM-Net [15] | 1.30982 | 9.928 | 1.42085 | 10.182 | 1.83379 | 21.558 | 2.10042 | 15.735 |
UGC-Net (Ours) | 1.11867 | 7.090 | 1.16425 | 7.791 | 1.59277 | 15.730 | 1.96897 | 13.403 |
Table 10.
Comparison of each variant model quantitative results for the ablation study. Each ablation model reflects different combinations of the proposed components to evaluate their respective contributions to disparity estimation performance. UGC_ablation1 uses only the geometry cost volume, while UGC_ablation2 uses the AFV cost volume. UGC_ablation3 includes cost volume fusion (CVF) without uncertainty computation. UGC_ablation4 adds uncertainty attention without weighting to the context cost volume. The Full Model is our method. ✓ indicates the inclusion of a specific module in each ablation.
Table 10.
Comparison of each variant model quantitative results for the ablation study. Each ablation model reflects different combinations of the proposed components to evaluate their respective contributions to disparity estimation performance. UGC_ablation1 uses only the geometry cost volume, while UGC_ablation2 uses the AFV cost volume. UGC_ablation3 includes cost volume fusion (CVF) without uncertainty computation. UGC_ablation4 adds uncertainty attention without weighting to the context cost volume. The Full Model is our method. ✓ indicates the inclusion of a specific module in each ablation.
Model | Components | Jacksonville | Omaha |
---|
Context Cost | CVF Module | UA Module | Weighted Context | EPE | D1 | EPE | D1 |
---|
UGC_ablation1 | | | | | 1.390 | 9.24 | 1.461 | 8.53 |
UGC_ablation2 | ✓ | | | | 1.388 | 9.10 | 1.433 | 8.11 |
UGC_ablation3 | ✓ | ✓ | | | 1.318 | 8.83 | 1.403 | 7.48 |
UGC_ablation4 | ✓ | ✓ | ✓ | | 1.311 | 8.81 | 1.370 | 7.33 |
Full Model (Ours) | ✓ | ✓ | ✓ | ✓ | 1.280 | 8.50 | 1.356 | 7.24 |