Training Data for Stereo Matching Algorithms Based on Neural Networks and a Method for Data Evaluation

Kaczmarek, Adam L.

doi:10.3390/app152312663

Open AccessArticle

Training Data for Stereo Matching Algorithms Based on Neural Networks and a Method for Data Evaluation

by

Adam L. Kaczmarek

Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology, 80-233 Gdansk, Poland

Appl. Sci. 2025, 15(23), 12663; https://doi.org/10.3390/app152312663

Submission received: 4 November 2025 / Revised: 24 November 2025 / Accepted: 27 November 2025 / Published: 29 November 2025

(This article belongs to the Section Robotics and Automation)

Download

Browse Figures

Versions Notes

Abstract

This paper addresses the problem of selecting data for learning stereo matching algorithms. The paper presents an overview of currently available learning datasets, including synthetic data and data from real environments. Stereo matching algorithms based on neural networks require high quality learning data in order to provide expected results, which have a form of disparity maps. There are hundreds of stereo matching algorithms for processing images from stereo cameras. However, a significant problem with this 3D data processing technology is that there is a relatively low availability of learning data with dense ground truth from real environments. The paper also introduces an evaluation method for estimating the quality of input data. The method considers features such as the quality of calibration, the density of ground truth, and the occurrence of occluded areas. The research presented in this paper is aimed at developing stereo matching methods applicable not only to benchmarks of this kind of algorithms but also to applications in real environments.

Keywords:

stereo matching; stereo camera; disparity map; depth map; deep learning; neural networks; learning data

1. Introduction

An application of a deep learning method requires teaching the neural network with the use of valid data. This paper focuses on data availability for training networks in the field of obtaining disparity maps on the basis of images from stereo cameras. The paper also addresses the problem of data quality. In the paper, a novel method for estimating the quality of datasets is proposed. Data quality is crucial for correct training of neural networks. If data contains inappropriate values, then neural networks are likely to provide incorrect results reflecting the quality of their training data.

Another problem with stereo matching algorithms is that these algorithms are often trained in such a way that they match criteria set by testbeds and benchmarks. Such a scenario occurs for multiple algorithms, including those proposed by Guo et al., Knöbelreiter et al., and Žbontar and LeCun [1,2,3]. This causes tested algorithms to receive high scores in benchmarks; however, they are not tuned to be applicable in real environments. There are two main benchmarks used for developing stereo matching algorithms. These are Middlebury Stereo Evaluation and the KITTI Vision Benchmark Suite (KITTI) [4,5]. These benchmarks provide both test data and rankings of algorithms based on the quality of results obtained from tested stereo matching algorithms. If a stereo matching algorithm based on neural networks is going to be evaluated with the use of some benchmark, then it uses training data from this benchmark. In order to make these algorithms suitable for images other than those from benchmarks and for general use, it is best to expand the quantity and diversity of training data.

The original contributions of the paper are the following: (1) It lists sources of data for learning stereo matching algorithms based on neural networks. The list includes both datasets with images taken from real stereo cameras and synthetic images resembling images from a stereo camera. (2) The paper introduces a novel method for evaluating the quality of data for the purpose of learning stereo matching algorithms. (3) The paper presents results of evaluation of publicly available datasets.

2. Related Work

The main purpose of stereo matching algorithms is the estimation of distances to objects located in the field of view of a stereo camera. The technology of stereo vision has a great potential because it can be used in the presence of intensive natural light and it provides dense data [5]. Moreover, a measurement can be performed without the necessity to relocate a scanning device. However, in real applications, often other kinds of distance measuring technologies are used such as Light Detection and Ranging (LIDAR), structured light 3D scanning, or 3D scanning with the use of a moving camera [6,7].

2.1. 3D Scanning Technologies

LIDAR is a popular technology designed for measuring distances. This method has a wide range of applications, including automotive industry, terrain modeling, and other forms of 3D scanning [8,9]. The main problem with LIDAR is such that it provides sparse data. The result of scanning using LIDAR is a set of distances to distant points in 3D space. This problem is further discussed in Section 2.3.2 of this paper.

Technologies based on structured light 3D scanning provide dense data. However, in general, this kind of scanner cannot be used outdoors because natural light interferes with the measurement of 3D scanners, deteriorating the quality of results [10]. The cause of it is such that a structured light 3D scanner needs to emit a light pattern in order to perform a measurement. This requirement also causes 3D scanners to not be able to be used for distant objects, which cannot be sufficiently illuminated by the light source of the scanner.

There are also technologies of 3D scanning based on single cameras, which are used to take images from different points of view located around the scanned object [11]. Retrieving this kind of scan is based on methods such as multi-view stereo and structure from motion.

Stereo cameras have all the benefits of other 3D scanning technologies. However, they are rarely used in real applications in comparison to other 3D scanning technologies [6]. The quality of results of using stereo cameras is often lower than using other methods. This paper addresses this problem in the field of availability of training data for algorithms based on neural networks.

2.2. Main Features of Stereo Matching Technology

In general, there are two types of stereo matching algorithms. There are algorithms based on neural networks and generic algorithms that do not use this technology. Generic algorithms do not require training. However, results of algorithms listed in benchmarks such as Middlebury Stereo Vision and KITTI show that algorithms based on neural networks achieve better results than other kinds of algorithms [12,13].

Stereo matching algorithms which do not use neural networks often consist of two main steps, which are local matching and global optimization [14]. Local matching is performed in order to match points from a reference image with corresponding points in the side image. The matching is performed in the area of a side image in which it is expected that a corresponding point is present. In this process, a similarity function is used. After local matching, an algorithm performs global optimization in which it considers relations between results of local matching obtained for different points of a reference image.

There are also stereo matching algorithms based on neural networks. They operate similarly as algorithms performing image recognition. The algorithm learns on the basis of input data and expected results. In the case of stereo matching algorithms, input data are pairs of calibrated images, and expected results are acquired from ground truth delivered in the form of disparity maps. Neural networks proved their usability in various applications including stereo matching. In the KITTI ranking, the majority of algorithms placed as the top 10 best algorithms use neural networks as of 22 November 2025 [13].

A dataset appropriate for training a stereo matching algorithm needs to consist of three elements. These are left images, right images, and ground truth (GT) defining what values are expected to be obtained as a result of stereo matching. In the case of training stereo matching algorithms, ground truth should have a form of disparity maps. A disparity map shows differences in locations of objects in images from a stereo camera. Such a map can be transformed to a depth map representing distances between a stereo camera and objects located in the field of view of the stereo camera. Parameters of this transformation can be retrieved in the process of calibrating a stereo camera [15]. Functions of cameras included in a stereo camera are different. One of the cameras has the function of a reference camera. A reference camera is a point of view of a camera set. A second camera, called a side camera, is only used for identifying disparities in points visible in the image obtained from the reference camera.

The requirement to provide ground truth makes it challenging to develop data for training stereo matching algorithms with pictures of actual objects. Ground truth is most often obtained with the use of 3D scanning equipment such as structured light 3D scanners and scanners based on Light Detection and Ranging (LIDAR) [16,17]. In order to create GT, a 3D scanning device must also be calibrated in relation to the stereo camera’s location, as these devices are situated differently in a real 3D space. This calibration is required because data from a 3D scanning device must resemble data captured from a stereo camera’s point of view.

2.3. Datasets

Many datasets used for learning stereo matching algorithms were originally prepared as parts of benchmarks for testing these algorithms [4,5]. Datasets in benchmarks are divided into data for training stereo matching algorithms and data for testing them. Testbeds for training contain both input image pairs and ground truth. This data is suitable for training stereo matching algorithms based on neural networks. Benchmarks also contain datasets for testing. For these datasets, ground truth is usually not publicly available. It is only used in the benchmark system for evaluating stereo matching algorithms.

2.3.1. Datasets with Dense Ground Truth

This section presents information about datasets with ground truth that is dense. Such a GT is usually obtained using a structured light 3D scanner. One of the most important datasets is the one provided with Middlebury Stereo Evaluation [4,12]. Middlebury Stereo Evaluation is a widely known testbed developed since 2001. As of October 2025, the testbed consists of 95 datasets, including 24 sets made with a mobile device. Each dataset contains a pair of images taken by a stereo camera and ground truth, which was acquired from a structured light 3D scanner. Images included in the Middlebury dataset were taken indoors, and they present scenes with various kinds of objects. Middlebury Stereo Evaluation also provides a ranking of stereo matching algorithms. It considers 273 algorithms as of 27 October 2025.

Another source of datasets with dense ground truth is the ETH3D benchmark. ETH3D was published in 2017 [18]. It contains datasets for stereo vision, multi-view stereo, and simultaneous localization and mapping (SLAM). Ground truth for these sets was prepared with the use of a laser scanner. The ETH3D benchmark contains a ranking of stereo matching algorithms. There are over 500 stereo matching algorithms on the list of ranked algorithms. ETH3D contains 27 training and 20 testing datasets made with a two-view stereo camera, which was a part of a multi-camera rig.

Booster is a dataset that was used in challenges called NTIRE 2024, TRICKY 2024, and NTIRE 2025, announced in conjunction with the European Conference on Computer Vision (ECCV) [19,20,21]. The training subset of the Booster set contains 38 different indoor scenes. Booster consists of images with transparent and reflective objects. This kind of objects is particularly hard to process with stereo matching algorithms. Booster is foremost designed for researching the usage of stereo matching algorithms with such input data.

Wang et al. described, in 2023, the PlantStereo dataset containing stereo images of plants [22]. PlantStereo contains images of pepper, pumpkin, spinach, and tomato plants. The quantity of training sets and test sets is presented in Table 1. Authors prepared ground truth with the use of a depth camera based on structured light. Images were taken inside the building.

InStereo2K is a collection of 2050 stereo images [23]. The set consists of 2000 training stereo pairs with ground truth. Ground truth was obtained by applying active-stereo 3D imaging. Active-stereo 3D imaging is a technology in which taking images with the use of a stereo camera is supported by an additional projector that emits light patterns. It is similar to the technology of structured light 3D imaging in which only one camera and a light source are used. Images included in InStereo2K were collected inside the building using real objects.

Holopix50k is a large dataset obtained from dual-camera mobile phones [24]. Holopix50k contains 49,368 image pairs collected from the Holopix mobile social platform. Images are categorized with regard to their content. The main problem with this dataset in the context of training stereo matching algorithms is ground truth. Ground truth was not prepared using any 3D scanning device, but it was obtained using software trained on the basis of the Middlebury dataset. Therefore, GT does not provide original training data apart from data collected in the Middlebury dataset. Flickr1024 and Web Stereo Video Dataset (WSVD) are also sources of real pairs of images from stereo cameras [25,26]. However, these sources do not provide ground truth.

2.3.2. Datasets with Sparse Ground Truth

There are also datasets in which ground truth only partly covers areas visible in images from a reference camera. There are datasets in which GT was prepared using LIDAR.

One of the most important datasets of this kind is KITTI. KITTI is a widely used benchmark for stereo matching algorithms [13]. The benchmark was developed for the purpose of using stereo matching algorithms for autonomous driving. Therefore, KITTI contains images taken from a car that was moving along streets. Authors used LIDAR for obtaining ground truth. KITTI consists of a dataset prepared in 2012 and another one prepared in 2015. The set from 2012 contains 194 training image pairs and 195 test image pairs. The other one consists of 200 training pairs and 200 test pairs. Apart from datasets, KITTI contains a ranking of stereo matching algorithms. The ranking based on the 2015 dataset contains 366 algorithms as of 27 October 2025 [13].

Data in ground truth provided with these datasets is sparse for the most part. A sample pair containing an image from one of the cameras and corresponding ground truth is presented in Figure 1. The part of GT where a car is visible is dense, but other parts of ground truth are sparse. Details on the density of data in this dataset are presented in Section 4.

A large dataset with sparse ground truth is Multi-Spectral Stereo (MS2) [27]. It contains over 180,000 image pairs. The dataset was obtained as a result of using a moving car with attached sensors. Data was collected in urban areas. Authors used stereo cameras, stereo thermal cameras, stereo near infrared (NIR) cameras, stereo LIDAR systems, and GPS/IMU information. Moreover, data was acquired at various times of day and in various weather conditions. Although the quantity of data in this dataset is high, using this data for the purpose of learning stereo matching algorithms is problematic because the dataset does not contain dense ground truth. MS2 includes data from two LIDAR systems, which can have a function of providing ground truth. A sample set of data containing a pair of images and corresponding data from LIDAR is presented in Figure 2.

There are also many other datasets acquired using a stereo camera and a LIDAR sensor with a car or another kind of equipment. These datasets and their main features are listed in Table 2.

2.3.3. Synthetically Rendered Datasets

There are many synthetically created datasets containing pairs of images and ground truth. The greatest advantage of rendered images is such that they are much easier to create than real stereo images with ground truth. However, rendered images also have a major disadvantage in that they resemble reality only to some limited extent. There are many more inaccuracies and distortions in images obtained from real cameras than in rendered images. These inaccuracies occur in calibration, which is not perfect in any real applications [6]. Moreover, the quality of calibration of real cameras deteriorates over time [38].

Table 3 presents a list of synthetic datasets and their main features. The largest dataset is FoundationStereo created by NVIDIA Research in 2025 [39]. The support from an influential company, NVIDIA, made it possible to create a large dataset including over 1 million pairs of images. As a synthetic dataset, it also contains full ground truth. NVIDIA also supported the development of a novel stereo matching algorithm whose name is the same as the name of the dataset. Wen et al. presented both the dataset and the algorithm [39]. This algorithm was trained using the FoundationStereo dataset and other synthetic datasets.

Falling Things is an older synthetic dataset released by NVIDIA Research [40]. The dataset contains over 60,000 images containing views of 21 household objects. Falling Things was developed using Unreal Engine 4.

Mayer et al. were one of the main precursors of creating synthetic datasets for stereo matching algorithms [41]. They developed datasets in 2016, which they called Scene Flow [41]. Scene Flow includes three datasets called FlyingThings3D, Driving, and Monkaa. These datasets contain images for training and testing algorithms for optical flow, segmentations, and stereo matching. There are, in total, over 39,000 stereo frames and ground truth.

Another significant dataset is Spring [42]. It is a synthetic dataset intended for researching stereo matching, optical flow, and scene flow estimation. The set contains 6000 image pairs. Blender was used in the rendering process. Authors of the Spring dataset also provided, in 2025, a dataset called RobustSpring [43]. RobustSpring contains images that are intentionally corrupted in order to better resemble real images. Effects used for deteriorating images included reducing the quality of images, changes in brightness, creating incorrect focus, and adding other kinds of noises.

2.3.4. Other Datasets

Fisher from the University of Edinburgh published a list of data sources for developing and testing algorithms in the field of computer vision [52]. As of 2025, the list is regularly updated. The list contains a variety of data for different purposes, including camera calibration, research on 3D point clouds, agricultural applications, underwater data processing, and gesture recognition. In total, Fisher provided over 2000 links to data sources.

There is also a dataset called NYU which is well known in the field of object segmentation [53]. This dataset contains images from a single RGB camera and depth maps acquired with the use of Microsoft Kinect. NYU does not provide stereo images. Therefore, it is not suitable for learning stereo matching algorithms.

There are also some datasets that are described in research papers. However, their data is no longer available as of 31 October 2025. Such datasets are ApolloScape, TorontoCity, HCI Benchmark, CREStereo, HR-VS, and HR-RS [54,55,56,57,58].

3. Materials and Methods

This section describes a novel evaluation method introduced in this paper. The method is called Stereo Camera Dataset Quality Evaluation Method (SCDQEM). It needs to be emphasized that the SCDQEM method is designed for evaluating the quality of input data for learning stereo matching algorithms. It is not a method for evaluating the quality of stereo matching algorithms or the quality of their results. The aim is to improve the quality if learning data in order to create algorithms which deliver higher quality of results. In the method, there are no random factors which may cause the method to provide different results for the same kind of input data. The method consists of the following steps:

Estimating the quality of calibration.
Determining the extent of occluded areas.
Verifying the completeness of ground truth.

These steps are described in subsequent subsections.

3.1. Estimating the Quality of Calibration

In the process of pattern-based calibration, images from a stereo camera are transformed in order to reduce distortions. One of the commonly used calibrating methods is implemented in the OpenCV library [15,59]. This method is based on the algorithm proposed by Zhang et al. [60]. OpenCV used a transformation which takes into account six radial coefficients, two tangential distortion coefficients, and four prism distortion coefficients. Values of these coefficients are obtained as a result of analyzing a series of images of a chessboard pattern taken with the use of a stereo camera. The calibration method aims at reducing distortions in points that are corners of squares. This is a minimization problem. Zhang et al. solved it using the Levenberg–Marquardt algorithm.

Nevertheless, considering all twelve distortion parameters and minimizing the error rate does not solve the problem with calibration when these methods are applied to real environments. A problem occurring in real environments is such that authors of datasets may not be diligent enough in preparing their setup and performing the calibration process. Problems can occur with low quality of a chessboard pattern, inappropriate mounting of cameras, the influence of temperature, an insufficient amount of images of a chessboard pattern, and other factors. For these reasons, authors of the KITTI benchmark calibrated their stereo camera every day [17].

Because of the problems with practical application of the calibration process there is still some inaccuracy in images from stereo cameras after pattern-based calibration. These are distortions that may negatively influence the process of obtaining disparity maps from pairs of images. This section proposes a method for evaluating the quality of calibration based on the usage of descriptors.

Reducing distortions that arise when taking images with a stereo camera can be accomplished through pattern-based calibration. There are intrinsic distortions created by the lens and other parts of a stereo camera. There are also extrinsic distortions caused by inaccuracies in mutual locations of cameras in a stereo camera. When there are images of a predefined pattern, such as a chessboard, it is possible to estimate the extent of all kinds of distortions. However, in the case of having images that are not intended to be used for calibrating cameras, there are fewer possibilities for estimating intrinsic and extrinsic distortions. The most significant distortion, however, is the difference in the y-axis between a right and a left image in an image pair. Such a discrepancy causes a significant deterioration in the quality of results [6]. This section describes a method for minimizing it.

Scharstein et al. addressed the problem of discrepancies in the y-axis occurring after a pattern-based calibration [16]. They encountered this problem during the preparation of the dataset for the Middlebury Stereo Vision testbed. They solved it by applying bundle adjustment to images that were already calibrated with the use of a pattern-based calibration. This problem was also noticed by Yang et al., who proposed a stereo matching algorithm robust to miscalibration in the y-axis [58]. The method for estimating the quality of calibration presented in this paper aims at automatically detecting discrepancies in the y-axis. This makes it possible to correct the miscalibration. The method consists of the following steps:

Identifying keypoints in both the left and right images.
Matching keypoints.
Selecting keypoints.
Calculating an average discrepancy.

These steps are described below.

In the method proposed in this paper, a local feature descriptor is used to identify keypoints. By default, the Speeded Up Robust Features (SURF) descriptor is used. However, the usage of other descriptors is also possible [61]. There are widely known descriptors such as Scale-Invariant Feature Transform (SIFT), Binary Robust Invariant Scalable Keypoints (BRISK), Binary Robust Independent Elementary Features (BRIEF), rotated BRIEF (ORB) or A-KAZE [62,63,64,65]. SURF is an improved version of the SIFT descriptor. Descriptors make it possible to identify keypoints in images. In the SCDQEM method, keypoints are identified on both of two images from a stereo camera.

Karami et al. presented the comparison of descriptors SIFT, SURF, BRIEF and ORB [66]. Descriptors are designed for identifying corresponding keypoints in images that differ in rotation, scale, and other forms of image transformation. It is possible to identify corresponding keypoints even in two images such that one of them is rotated over 45° than the other image. Descriptors can also be used when one of the images is at least two times larger than the other one. In the case of using descriptors for stereo images obtained from a stereo camera after calibration, these kinds of distortions are minor in comparison to distortions for which descriptors were designed. Karami et al. noticed that there are cases in which the SIFT descriptor is more precise than SURF. These differences are insignificant for images so similar as a pair of calibrated images from a stereo camera. The SURF descriptor was selected for practical reasons, as its speed is higher.

After identifying keypoints, the SCDQEM method matches keypoints located in the left image with keypoints in the right image. In the perfect match, every pair of matched keypoints from two images shows the same part of a real object. However, matching also results with inappropriate matches in which keypoints from different images are matched with each other despite these keypoints relating to different real objects. Therefore, there is a necessity to exclude some matched keypoints from further consideration.

In the case of matching keypoints corresponding to the same parts of real objects, many inappropriate matches can be identified. After the pattern-based calibration, there are limited areas in which it is expected that a keypoint from one image will be matched with a keypoint in the other image. If there is a keypoint located at coordinates

(x_{0}, y_{0})

in one image of a stereo pair matched with another keypoint placed at

(x_{1}, y_{1})

in the second image, then the discrepancy in the y-axis is equal to

| y_{0} - y_{1} |

.

As presented in the beginning of this section, inaccuracies in calibration are not caused by inappropriate minimization of the error rate in the calibration process, but these inaccuracies are caused by the lack of precision in performing the calibration process by authors of datasets. Therefore, it is not possible to estimate a limit for possible error in calibration. If incompetent authors were to prepare datasets, these discrepancies might be of any size. For this reason, the method presented in this paper operates under the assumption that a dataset is usable for a general-purpose stereo matching algorithm even if there are some errors in calibration. As presented in [6], if the discrepancy in the y-axis is greater than 5 pixels, a pair of images is unusable for stereo matching because the error rate rises to over 80%.

As a result, this value was chosen as the threshold’s default value, meaning that corresponding keypoints with a larger y-axis discrepancy are deemed to be incorrectly matched. These keypoints are excluded from further calculations. The threshold will be marked with H.

There are applications in which this threshold H can be set differently. Such a case occurs when a stereo matching algorithm is intentionally designed to be robust to calibration errors. This kind of algorithm is, for example, HSMNet [58]. If data for such an algorithm is collected, then the value of a threshold H can be set accordingly to a characteristic of a specific algorithm with which data is used.

The extent of a miscalibration is estimated on the basis of keypoints that were not excluded because of the threshold H. For all these keypoints, an average value of discrepancy is calculated as presented in Equation (1).

AVDISC = \frac{\sum_{k = 1}^{K} (y_{k 0} - y_{k 1})}{K}

(1)

where

y_{k 0}

and

y_{k 1}

are Y coordinates of corresponding keypoints and K is the number of pairs of considered keypoints.

Values

A V D I S C

are obtained for all considered pairs of images in the dataset for which the evaluation is performed. The average value for all considered pairs in the set will be marked with

\bar{A V D I S C}

. This value after normalization is equal to the value of the metric called discrepancy in calibration (

D C B

) which indicates the quality of calibration in the y-axis.

The lowest possible value of

\bar{A V D I S C}

is 0 when the calibration is perfect. Any values higher or lower indicate miscalibration. The maximum value of

\bar{A V D I S C}

is related to a maximum possible size in pixels of an image. The only definite limit to this value is such that it is lower than infinity.

Values of

\bar{A V D I S C}

are normalized in order to obtain values of

D C B

.

D C B

is in the range between 0 and 1. The value of 1 indicates the highest quality, and 0 is the lowest quality. The transformation of

\bar{A V D I S C}

to

D C B

is performed with the use of reciprocal function as presented in Equation (2).

DCB = \frac{1}{1 + | \bar{A V D I S C} |}

(2)

The SCDQEM method presented in this paper also considers deviations from the average value. It is possible that an evaluated dataset contains image pairs for which miscalibration is significant for some pairs. However, on average, the result is close to the correct one. This problem is addressed using the metric called normalized standard deviation of calibration (

S D C

).

S D C

is based on standard deviation of values AGDISC calculated for different image pairs in the considered dataset. The standard deviation obtained from these values is normalized in the same way as the

D C B

. The reciprocal function was used. The formula for

S D C

is presented in Equation (3).

SDC = \frac{1}{1 + \sqrt{\sum_{i = 1}^{N} {(A V D I S C_{i} - D C B)}^{2} / N}}

(3)

where

{A V D I S C}_{i}

is the average value obtained with Equation (1) for the i-th pair in the dataset, and N is a number of image pairs.

In the experiments presented in this paper, all steps of this method for this estimating the quality of calibration were implemented with the use of the OpenCV library. Version 3.4.13 of this library was used [15]. Parameters required for this algorithm are pairs of images and the threshold H. There are also parameters used in the SURF algorithm and the function for matching descriptors. The SURF algorithm takes minimum Hessian threshold as an argument. This parameter is described in [61]. In the software used in this paper, the default value of 400 was used. This parameter is not expected to be modified in the SCDQEM method. As far as using the matcher of keypoints is concerned, it is also parametrized because it can be used with different kinds of norm. The NORN_L2 is used as it is a default one in the OpenCV library [67]. It is also not expected that a different norm will be used with the SCDQEM method. The algorithm then iterates over the list of matched thresholds in order to verify if the requirement set by the H threshold is met. In the last part of the algorithm, values presented in Equations (1)–(3) are calculated.

3.2. The POA Metric

The SCDQEM method also uses a different metric called the presence of occluded areas (

P O A

), which is connected to occluded areas in stereo camera images. Occluded areas are those for which it is not possible to retrieve disparity because these areas are visible only from one camera. It is impossible to match a portion of an object from a reference image with the same portion of an object in the side image when an object placed closer to a stereo camera partially covers background objects.

A sample placement of an object causing the occurrence of an occluded area is presented in Figure 3. Figure 3a,b present images taken from a stereo camera. In these images, a red circle is placed closer to a stereo camera and it partly covers a green rectangle. In the disparity map presented in Figure 3c, an area at the left side of a circle is visible only from a left camera. Therefore, it is marked with black as it is an occluded area.

Let us assume that there is a point

p

at coordinates

(x, y)

in a reference image

I_{r}

. The disparity in point

p

will be marked with d. The fact that the disparity in

p

is equal to d indicates that in the side image

I_{s}

, there is a point corresponding to

p

at coordinates

(x - d, y)

. However, if in image

I_{r}

there is a point

p_{c}

with disparity equal to

d_{c}

such that

p_{c}

is located at coordinates

(x - d + d_{c}, y)

and

d_{c} > d

, then in the side image point corresponding to point

p

will be covered by the point corresponding to point

p_{c}

. This kind of occlusion can be identified on the basis of ground truth. Let us consider, in ground truth G, a point

g

with coordinates

(x, y)

for which disparity is equal to d. This disparity should be marked as an occluded area if there is in G a point

g_{c}

at coordinates

(x_{c}, y)

such that disparity

d_{c}

in this point is equal to

d + (x_{c} - x)

and

x_{c} > x

. Equation (4) presents the formula for all points in ground truth that should be marked as those belonging to the occluded area.

g_{i} \in C i f \underset{d_{i}, d_{i c}, g_{ic}}{\exists} (x_{i} - d_{i} = x_{i c} - d_{i c}) \land x_{i c} > x_{i}

(4)

where C is a set of occluded points. Disparities are marked with

d_{i}

and

d_{i c}

. Symbols

x_{i}

and

x_{i c}

are coordinates of points

g_{i}

and

g_{ic}

in x-axis in ground truth.

In real ground truth provided with datasets, there are points for which values of disparities are present. However, these points are, in fact, in occluded areas. Let U denote a set of points in ground truth for which a value of disparity is undefined. Equation (5) defines a percentage of points correctly classified as occluded for a single pair of images.

{POA}_{e} = 1 - \frac{n (U \cap C)}{n (C)}

(5)

where the

n ()

function represents the cardinality of a set.

Values defined by Equation (5) are used to calculate the

P O A

metric introduced in this paper. This metric is equal to an average of values

P O A_{e}

calculated for all image pairs in the considered dataset. The range of values of the

P O A

metric is from 0 to 1, where 1 is the best value.

3.3. CPL Metric

Another metric called completeness (

C P L

) refers to the extent to which ground truth covers the area of a reference image. For certain datasets, the ground truth included with the datasets does not include disparity values in regions where disparities can be retrieved using a pair of input images. The

P C L

metric takes into account occluded areas described in the previous subsection. Ground truth should contain information about disparities for all points apart from points in the occluded areas. The

C P L

metric reflects the extent to which ground truth contains these data. Equation (6) shows the method for calculating completeness for a single pair of images in a dataset.

{CPL}_{e} = \frac{n (G ∖ C)}{n (A)}

(6)

where G is a set of points in ground truth and A is a set of points in

I_{r}

for which there is a corresponding point in

I_{s}

.

The value of the

C P L

metric for the whole dataset is equal to the average of values

C P L_{e}

calculated on the basis of every image pair in the set. Similarly, as in case of the

P O A

metric,

C P L

takes values between 0 and 1.

4. Results

In experiments presented in this paper, the SCDQEM method presented in Section 3 was applied to the following datasets: Middlebury Stereo Evaluation, ETH3D, Booster, PlantStereo, KITTI, MS2, and synthetic datasets.

4.1. Evaluation of the Middlebury Dataset

The evaluation of the Middlebury dataset with the use of the SCDQEM method proved its unique quality. As far as the calibration is concerned, the result of the

D C B

metric was equal to 0.96 and the result of the

S D C

metric was equal to 0.85. It reflects the effort with which cameras used for preparing this set were calibrated. Cameras used for preparing the Middlebury set were calibrated using not only chessboards but also bundle adjustment [16].

As far as occluded areas are concerned, the Middlebury dataset is unique because the authors provided files with masks marking occluded areas. Precise masks of all occluded areas are available for all training sets in the dataset. The

P O A

metric considering the existence of masks it equal to 1.

In the Middlebury dataset there are not many areas in ground truth that do not contain values of disparities. However, such areas exist. Results showed the value of the

C P L

metric equal to 0.96. Figure 4 shows a sample ground truth provided with the Middlebury dataset for a pair called Shelves. Black areas show parts of ground truth in which disparity is missing.

4.2. Evaluation of ETH3D

The value of the

D C B

metric obtained for the ETH3D dataset was equal to 0.99 and

S D C

was equal to 0.87. It indicates that this dataset also has a high quality of calibration.

A result that is more inappropriate was obtained for the

C P L

metric. In the case of ETH3D, this metric was equal to 0.6. It indicates that ground truth contains large areas for which disparity is undefined, although it should be available. This problem is illustrated in Figure 5, which presents a sample image from a reference camera and ground truth.

Areas missing in ground truth are mainly those placed in the background of the scene. Regions with missing disparities also cover occluded areas. The value of the

P O A

metric calculated for this dataset is equal to 1.

4.3. Evaluation of Booster

In the case of the Booster dataset, the value of the

D C B

metric is equal to 0.83. Thus, on average, the discrepancy in the y-axis is low. However, the obtained value of

S D C

was 0.24. It indicates significant extrinsic distortions between pairs of images. Results of

S D C

show that for some pairs of images included in the set, discrepancies in the y-axis are at least over three points, which is a lot in the case of processing images with a stereo matching algorithm. It needs to be noted that the resolution of images in the set is 4112 × 3008 which is many times higher than the resolution of images in other datasets. Images with lower resolution are less affected by inaccuracies in calibration. Nevertheless, if the Booster set was used to train a neural network-based stereo matching algorithm, then such an algorithm would also have to consider inaccuracies in calibration.

The Booster dataset provides occlusion maps similarly to the Middlebury set. Occlusion maps correctly identify occluded areas. Thus,

P O A

is equal to 1. Ground truth of the Booster dataset does not contain data for all points for which it is possible to retrieve disparities. Experiments showed that the value of the

C P L

metric is equal to 0.95.

4.4. Evaluation of PlantStereo

In the case of the PlantStereo dataset, the SCDQEM method was applied to each group of datasets. Results of evaluating calibration are presented in Table 4.

Table 4 shows quantities of datasets, values of

D C B

, values of

S D C

, minimum discrepancies among pairs of images included in datasets, and maximum discrepancies. Results show that on average the right image is shifted only 0.07 points with regard to the left image. It indicates that the quality of calibration in these datasets is very high. However, it needs to be noted that in the case of the testing dataset of the pepper plant, the average discrepancy is equal to 0.41 which is almost half a pixel. This degree of miscalibration is apparent since it could affect the quality of the disparity maps that are derived from the dataset.

The next step of the quality assessment is checking the presence of occluded areas in ground truth. In the case of PlantStereo the

P O A

metric was equal to 0.51. Thus, the PlantStereo dataset only partially considers the lack of visibility in some areas. In nearly half of the regions in this dataset where stereo matching algorithms cannot produce results, ground truth provides values of disparities.

The last step of the evaluation is calculating the completeness of ground truth. As far as this metric is concerned, there are some problems with PlantStereo. The result of calculating the

C P L

metric is 0.86. Figure 6 presents a sample stereo pair and corresponding ground truth. The figure is presented in order to show what kind of areas are missing in ground truth. In ground truth presented in Figure 6c, there are black areas resembling shadows of leaves. These areas are located on the sides of leaves. Therefore, GT implies that disparities in these areas are undefined. However, as shown in Figure 6a,b, large parts of these areas are clearly visible from both cameras. Therefore, ground truth in these areas should contain real values of disparities. Figure 6 presents only a sample set of images in order to illustrate this problem. However, this problem occurs in all images included in the set. Other images included in PlantStereo also show plants in pots placed on a gray flat surface.

The occurrence of missing areas in ground truth is most likely caused by the configuration of the setup used for obtaining datasets. A structured light scanner was positioned so that, from the scanner’s point of view, the presence of leaves restricted the amount of space that could be scanned in areas behind leaves. Nevertheless, apart from these areas, ground truth in the PlantStereo dataset is dense because it was prepared with the use of a structured light 3D scanner.

4.5. Evaluation of KITTI

The KITTI dataset contains stereo images taken from a moving car. Authors acknowledged that they calibrated their stereo camera every day of the data acquisition process [17]. This diligence is reflected in values of

D C B

and

S D C

metrics. The result of

D C B

is equal to 0.96 and the result of

S D C

is 0.88.

Unfortunately, such good values of metrics were not obtained for ground truth. In general, ground truth of the KITTI dataset is sparse because it was obtained with the use of a laser scanner. Nevertheless, additional processing of the scanner’s data allowed complete disparity coverage for views of significant objects such as cars. Sample images from this dataset are presented in Figure 1 in Section 2.3.2. Part (a) of this figure shows an image from the set, and part (b) is ground truth. It can be noticed that the ground truth of the car visible in the foreground of the image is dense.

Despite the fact that some areas are dense, the overall completeness of ground truth remains low. Results showed that for all images in the set, the value of

C P L

is equal to 0.19. As far as occluded areas are concerned, they do not occur in ground truth because it is sparse. On the basis of ground truth data, the

P O A

metric is equal to 1.

4.6. Evaluation of MS2

The quality of calibration in MS2 datasets is lower than in the KITTI dataset because the result of

D C B

is equal to 0.72 for MS2. It indicates a disparity in the y-axis of almost half of a point. There are no significant differences in the quality of calibration among stereo pairs included in MS2 because

S D C

is equal to only 0.89.

Ground truth in MS2 is sparse to a similar extent as in the KITTI dataset because

C P L

is equal to 0.17 for MS2. Because of the sparsity of ground truth, there are no problems with occlusions.

P O A

is equal to 1.

4.7. Evaluation of Synthetic Datasets

By synthetically creating image pairs, it is feasible to achieve flawless image calibration and provide ground truth with the scene’s completeness. Nevertheless, the problem with occlusions remains. In some synthetic datasets, this problem is addressed, and in others, it is ignored.

A synthetic dataset that considers occlusions is MPI Sintel [44]. Authors provided separate maps for occluded points and points that are out-of-frame. Because objects visible at the edge of a reference image cannot be included in the field of view of a side camera, points that cannot be matched are known as out-of-frame points. The value of the

P O A

metric for this dataset is equal to 1.

It is not common that a synthetic dataset provides maps of occlusions. Because of the absence of occlusion maps, the

P O A

metric is equal to 0. Such a case occurs for the Scene Flow dataset containing sets FlyingThings3D, Driving, and Monkaa [41].

Equation (5) used for identifying locations of occluded areas can also be used for generating maps of occluded areas. An example of such a usage is presented in Figure 7d. Occluded areas are marked with white color. Figure 7a,b present left and right images from the Monkaa dataset. Figure 7c is ground truth provided with this pair in the dataset. It can be noticed that the area of occlusions covers a large part of the image.

Maps of occlusions are also missing in newer and larger synthetic datasets. In particular, the FoundationStereo dataset, despite its significance, does not contain maps of occlusions [39].

In general, the problem with synthetic images is usually such that they lack imperfections that occur using real equipment. These imperfections include an inaccurate calibration of cameras, which not only exists in real cameras, but the problem is also such that the quality of calibration deteriorates over time [38]. In addition to variations in object locations, images of real objects frequently have variations in lighting, color, and reflections. Moreover, synthetic images often do not contain irregular patterns that are present on surfaces of real objects. Nevertheless, synthetic data is a useful data source for learning a stereo matching algorithm.

4.8. Evaluation Summary

Table 5 presents results for all tested datasets. Results presented in Table 5 show that the Middlebury dataset is the best source for learning stereo matching algorithms based on neural networks. Middlebury proved to have a high quality in every criterion. This set has precise calibration, a high level of completeness of ground truth, and it provides data regarding occluded areas. The ETH3D dataset also obtained high scores, apart from the fact that its ground truth does not fully cover expected areas. The problem with Booster is such that its calibration is inappropriate for some images in this set. PlantStereo has inappropriately placed occluded areas. As far as KITTI is concerned, it has ground truth obtained from LIDAR causing that GT is sparse in a large part. Full completeness of ground truth occurs in the case of synthetic datasets. However, because of the method used for obtaining these sets, they do not fully resemble the usage of stereo cameras in real environments.

5. Discussion

The main limitation of the presented method of evaluation is such that it does not consider the extent to which training data are relevant to test data for a specific application of a stereo matching algorithm. This problem can be illustrated in a scenario in which data from the Middlebury dataset would be used for training a network intended for use with the StereoPlant images. The Middlebury dataset has better scores than StereoPlant according to SCDQEM. However, both training data and test data in StereoPlant contain plants in pots on a gray surface. These training data have a similar content to test data. Therefore, a network trained with such data would be adapted to processing images of pots placed on a gray surface. Data from the Middlebury dataset would not be suitable to that extent because it contains different kinds of scenes. Nevertheless, the quality of StereoPlant would be higher if GT and occluded areas were marked correctly in this dataset. Another limitation of the proposed evaluation method is such that it can be used only with datasets in which ground truth is in the form of a disparity map.

Datasets, after being released, can still be improved. The improvement can occur in all of their parameters. In particular, the quality of calibration can be improved in the Booster dataset. Results showed that this dataset contains stereo pairs in which there is a high discrepancy in y-axis between points in reference images and corresponding points in side images. In order to reduce this discrepancy, it is sufficient to perform a translation of side images in the y-axis. The extent of this translation is determined by values of Equation (1). Points of these images would be shifted in the vertical direction. The problem with this transformation is such that it creates lines with no points at the bottom or at the top of side images. This would require cutting these parts of side images and the same kind of parts in reference images.

A more complicated modification is required to improve the PlantStereo dataset. The problem with this dataset is such that it contains areas without correct disparities in ground truth. This problem can be fixed by merging data from different stereo pairs included in this dataset. Large parts of missing data in ground truth occur on a gray board used for holding pots visible in images from the PlantStereo dataset. Let us consider a stereo pair for which GT does not contain data in the area A where disparities related to the gray board should be present. The missing area A occurred because leaves of plants limited access to this area from the point of view of a structured light 3D scanner used for preparing the dataset. However, the same board and the same setup are used for collecting data with different configurations of pots. Therefore, parts of the board for which ground truth was not acquired for one of the stereo pairs are visible in other stereo pairs in which the placement of pots was different. Ground truth data corresponding to other stereo pairs can be used to fill the missing area A.

There is also a possibility to improve datasets that do not consider occluded areas. In the case of synthetic datasets, the best way to provide maps of occlusions would be using the 3D model of objects from which pairs of images and ground truth were obtained. However, even only on the basis of ground truth, it is possible to calculate maps of occlusions using Equation (5) presented in Section 3.1. A sample of such a usage is presented in Figure 7d.

6. Conclusions

The main key finding regarding the research presented in this paper is such that the SCDQEM method can authentically estimate the quality of learning data for stereo matching algorithms. Another finding it such that, currently, there is a lack of available datasets with dense ground truth. There are a large number of synthetic datasets and datasets with sparse ground truth based on LIDAR. However, availability of data with dense ground truth is low. The research also showed that the Middlebury dataset is the one that has high quality in terms of all criteria.

Implications of these findings are such that researchers of stereo matching algorithms will have a better perspective on selecting data sources for learning algorithms that they use. The research presented in this paper contributes to both presenting a clear list of available data sources and estimating quality of these data.

There are three main directions for further development of data for learning stereo matching algorithms based on neural networks. One of these directions is preparing a greater amount of more diverse synthetic data. Recently, Wen et al. from NVIDIA significantly contributed to this field by releasing the FoundationStereo dataset [39]. Another direction is preparing synthetic data in such a way that these datasets contain inaccuracies occurring in real applications. Schmalfuss et al. adapted this approach by intentionally corrupting their synthetic data [43]. The last major direction for further development is the preparation of stereo pairs with dense ground truth from the real environment.

Funding

This work was supported, in part, by a ministry subsidy for research to Gdańsk University of Technology, Poland.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BRIEF	Binary Robust Independent Elementary Features
BRISK	Binary Robust Invariant Scalable Keypoints
GT	ground truth
LIDAR	Light Detection and Ranging
NIR	Stereo Near Infrared
NLP	Natural Language Processing
SIFT	Scale-Invariant Feature Transform
SLAM	Simultaneous Localization and Mapping
SURF	Speeded Up Robust Features

References

Guo, X.; Yang, K.; Yang, W.; Wang, X.; Li, H. Group-wise Correlation Stereo Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3273–3282. [Google Scholar]
Knöbelreiter, P.; Sormann, C.; Shekhovtsov, A.; Fraundorfer, F.; Pock, T. Belief Propagation Reloaded: Learning BP-Layers for Labeling Problems. In Proceedings of the The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Zbontar, J.; LeCun, Y. Stereo matching by training a convolutional neural network to compare image patches. J. Mach. Learn. Res. 2016, 17, 2287–2318. [Google Scholar]
Scharstein, D.; Szeliski, R. A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms. Int. J. Comput. Vis. 2002, 47, 7–42. [Google Scholar] [CrossRef]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar] [CrossRef]
Kaczmarek, A.L. A Method for Adapting Stereo Matching Algorithms to Real Environments. Appl. Sci. 2025, 15, 4070. [Google Scholar] [CrossRef]
Kaczmarek, A.L.; Blaschitz, B. Equal Baseline Camera Array—Calibration, Testbed and Applications. Appl. Sci. 2021, 11, 8464. [Google Scholar] [CrossRef]
Błaszczak-Bąk, W.; Kamiński, W.; Bednarczyk, M.; Suchocki, C.; Masiero, A. Real-Time DTM Generation with Sequential Estimation and OptD Method. Appl. Sci. 2025, 15, 4068. [Google Scholar] [CrossRef]
Zhang, Z.; Wei, C.; Wu, G.; Barth, M.J. Vulnerable Road User Detection for Roadside-Assisted Safety Protection: A Comprehensive Survey. Appl. Sci. 2025, 15, 3797. [Google Scholar] [CrossRef]
Gupta, M.; Yin, Q.; Nayar, S.K. Structured Light in Sunlight. In Proceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 545–552. [Google Scholar] [CrossRef]
Seitz, S.M.; Curless, B.; Diebel, J.; Scharstein, D.; Szeliski, R. A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; Volume 1, pp. 519–528. [Google Scholar] [CrossRef]
Middlebury Stereo Evaluation-Version 3. Available online: https://vision.middlebury.edu/stereo/eval3/ (accessed on 17 October 2025).
KITTI Vision Benchmark Suite. Available online: https://www.cvlibs.net/datasets/kitti/ (accessed on 17 October 2025).
Kaczmarek, A.L. 3D Vision System for a Robotic Arm Based on Equal Baseline Camera Array. J. Intell. Robot. Syst. 2019, 99, 13–28. [Google Scholar] [CrossRef]
Bradski, D.G.R.; Kaehler, A. Learning Opencv, 1st ed.; O’Reilly Media, Inc.: Newton, MA, USA, 2008. [Google Scholar]
Scharstein, D.; Hirschmüller, H.; Kitajima, Y.; Krathwohl, G.; Nešić, N.; Wang, X.; Westling, P. High-Resolution Stereo Datasets with Subpixel-Accurate Ground Truth. In Proceedings of the Pattern Recognition, GCPR 2014, Münster, Germany, 2–5 September 2014; Jiang, X., Hornegger, J., Koch, R., Eds.; Springer: Cham, Switzerland, 2014; pp. 31–42. [Google Scholar]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets Robotics: The KITTI Dataset. Int. J. Robot. Res. (IJRR) 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Schöps, T.; Schönberger, J.L.; Galliani, S.; Sattler, T.; Schindler, K.; Pollefeys, M.; Geiger, A. A Multi-View Stereo Benchmark with High-Resolution Images and Multi-Camera Videos. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Zama Ramirez, P.; Tosi, F.; Poggi, M.; Salti, S.; Di Stefano, L.; Mattoccia, S. Open Challenges in Deep Stereo: The Booster Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. CVPR. [Google Scholar]
Ramirez, P.Z.; Costanzino, A.; Tosi, F.; Poggi, M.; Salti, S.; Mattoccia, S.; Stefano, L.D. Booster: A Benchmark for Depth from Images of Specular and Transparent Surfaces. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 85–102. [Google Scholar] [CrossRef] [PubMed]
NTIRE 2025: HR Depth from Images of Specular and Transparent Surfaces. Available online: https://cvlab-unibo.github.io/booster-web/ntire25.html (accessed on 17 October 2025).
Wang, Q.; Wu, D.; Liu, W.; Lou, M.; Jiang, H.; Ying, Y.; Zhou, M. PlantStereo: A High Quality Stereo Matching Dataset for Plant Reconstruction. Agriculture 2023, 13, 330. [Google Scholar] [CrossRef]
Bao, W.; Wang, W.; Xu, Y.; Guo, Y.; Hong, S.; Zhang, X. InStereo2K: A large real dataset for stereo matching in indoor scenes. Sci. China Inf. Sci. 2020, 63, 212101. [Google Scholar] [CrossRef]
Hua, Y.; Kohli, P.; Uplavikar, P.; Ravi, A.; Gunaseelan, S.; Orozco, J.; Li, E. Holopix50k: A Large-Scale In-the-wild Stereo Image Dataset. arXiv 2020, arXiv:2003.11172. [Google Scholar]
Wang, Y.; Wang, L.; Yang, J.; An, W.; Guo, Y. Flickr1024: A Large-Scale Dataset for Stereo Image Super-Resolution. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 3852–3857. [Google Scholar] [CrossRef]
Wang, C.; Lucey, S.; Perazzi, F.; Wang, O. Web Stereo Video Supervision for Depth Prediction from Dynamic Scenes. In Proceedings of the 2019 International Conference on 3D Vision (3DV), Québec City, QC, Canada, 16–19 September 2019; pp. 348–357. [Google Scholar] [CrossRef]
Shin, U.; Park, J.; Kweon, I.S. Deep Depth Estimation From Thermal Image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 1043–1053. [Google Scholar]
Zhu, A.Z.; Thakur, D.; Özaslan, T.; Pfrommer, B.; Kumar, V.; Daniilidis, K. The Multivehicle Stereo Event Camera Dataset: An Event Camera Dataset for 3D Perception. IEEE Robot. Autom. Lett. 2018, 3, 2032–2039. [Google Scholar] [CrossRef]
Yang, G.; Song, X.; Huang, C.; Deng, Z.; Shi, J.; Zhou, B. DrivingStereo: A Large-Scale Dataset for Stereo Matching in Autonomous Driving Scenarios. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 899–908. [Google Scholar] [CrossRef]
Chang, M.F.; Lambert, J.W.; Sangkloy, P.; Singh, J.; Bak, S.; Hartnett, A.; Wang, D.; Carr, P.; Lucey, S.; Ramanan, D.; et al. Argoverse: 3D Tracking and Forecasting with Rich Maps. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Wilson, B.; Qi, W.; Agarwal, T.; Lambert, J.; Singh, J.; Khandelwal, S.; Pan, B.; Kumar, R.; Hartnett, A.; Pontes, J.K.; et al. Argoverse 2: Next Generation Datasets for Self-driving Perception and Forecasting. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS Datasets and Benchmarks 2021), Virtual Conference, 5–14 December 2021. [Google Scholar]
Lambert, J.; Hays, J. Trust, but Verify: Cross-Modality Fusion for HD Map Change Detection. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS Datasets and Benchmarks 2021), Virtual Conference, 5–14 December 2021. [Google Scholar]
Sivaprakasam, M.; Maheshwari, P.; Castro, M.G.; Triest, S.; Nye, M.; Willits, S.; Saba, A.; Wang, W.; Scherer, S. TartanDrive 2.0: More Modalities and Better Infrastructure to Further Self-Supervised Learning Research in Off-Road Driving Tasks. arXiv 2024, arXiv:2402.01913. [Google Scholar] [CrossRef]
Maddern, W.; Pascoe, G.; Linegar, C.; Newman, P. 1 Year, 1000km: The Oxford RobotCar Dataset. Int. J. Robot. Res. (IJRR) 2017, 36, 3–15. [Google Scholar] [CrossRef]
Maddern, W.; Pascoe, G.; Gadd, M.; Barnes, D.; Yeomans, B.; Newman, P. Real-time Kinematic Ground Truth for the Oxford RobotCar Dataset. arXiv 2020, arXiv:2002.10152. [Google Scholar]
Burri, M.; Nikolic, J.; Gohl, P.; Schneider, T.; Rehder, J.; Omari, S.; Achtelik, M.W.; Siegwart, R. The EuRoC micro aerial vehicle datasets. Int. J. Robot. Res. 2016, 35, 1157–1163. [Google Scholar] [CrossRef]
Zhao, T.; He, J.; Lv, J.; Min, D.; Wei, Y. A Comprehensive Implementation of Road Surface Classification for Vehicle Driving Assistance: Dataset, Models, and Deployment. IEEE Trans. Intell. Transp. Syst. 2023, 24, 8361–8370. [Google Scholar] [CrossRef]
Moravec, J.; Šára, R. High-recall calibration monitoring for stereo cameras. Pattern Anal. Appl. 2024, 27, 41. [Google Scholar] [CrossRef]
Wen, B.; Trepte, M.; Aribido, J.; Kautz, J.; Gallo, O.; Birchfield, S. FoundationStereo: Zero-Shot Stereo Matching. arXiv 2025, arXiv:2501.09898. [Google Scholar]
Tremblay, J.; To, T.; Birchfield, S. Falling Things: A Synthetic Dataset for 3D Object Detection and Pose Estimation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Los Alamitos, CA, USA, 18–22 June 2018; pp. 2119–21193. [Google Scholar] [CrossRef]
Mayer, N.; Ilg, E.; Häusser, P.; Fischer, P.; Cremers, D.; Dosovitskiy, A.; Brox, T. A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Mehl, L.; Schmalfuss, J.; Jahedi, A.; Nalivayko, Y.; Bruhn, A. Spring: A High-Resolution High-Detail Dataset and Benchmark for Scene Flow, Optical Flow and Stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
Schmalfuss, J.; Oei, V.; Mehl, L.; Bartsch, M.; Agnihotri, S.; Keuper, M.; Bruhn, A. RobustSpring: Benchmarking Robustness to Image Corruptions for Optical Flow, Scene Flow and Stereo. arXiv 2025, arXiv:2505.09368. [Google Scholar] [CrossRef]
Butler, D.J.; Wulff, J.; Stanley, G.B.; Black, M.J. A Naturalistic Open Source Movie for Optical Flow Evaluation. In Proceedings of the Computer Vision–ECCV 2012, Florence, Italy, 7–13 October 2012; Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 611–625. [Google Scholar]
Cabon, Y.; Murray, N.; Humenberger, M. Virtual KITTI 2. arXiv 2020, arXiv:2001.10773. [Google Scholar] [CrossRef]
Gaidon, A.; Wang, Q.; Cabon, Y.; Vig, E. Virtual worlds as proxy for multi-object tracking analysis. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4340–4349. [Google Scholar]
Deschaud, J. KITTI-CARLA: A KITTI-like dataset generated by CARLA Simulator. arXiv 2021, arXiv:2109.00892. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Ros, G.; Codevilla, F.; López, A.M.; Koltun, V. CARLA: An Open Urban Driving Simulator. arXiv 2017, arXiv:1711.03938. [Google Scholar] [CrossRef]
Wang, Q.; Zheng, S.; Yan, Q.; Deng, F.; Zhao, K.; Chu, X. IRS: A Large Naturalistic Indoor Robotics Stereo Dataset to Train Deep Models for Disparity and Surface Normal Estimation. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar] [CrossRef]
Karaev, N.; Rocco, I.; Graham, B.; Neverova, N.; Vedaldi, A.; Rupprecht, C. DynamicStereo: Consistent Dynamic Depth from Stereo Videos. arXiv 2023, arXiv:2305.02296. [Google Scholar] [CrossRef]
Straub, J.; Whelan, T.; Ma, L.; Chen, Y.; Wijmans, E.; Green, S.; Engel, J.J.; Mur-Artal, R.; Ren, C.Y.; Verma, S.; et al. The Replica Dataset: A Digital Replica of Indoor Spaces. arXiv 2019, arXiv:1906.05797. [Google Scholar] [CrossRef]
Fisher, R. CVonline: Image Databases. Available online: http://homepages.inf.ed.ac.uk/rbf/CVonline/Imagedbase.htm (accessed on 17 October 2025).
Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor Segmentation and Support Inference from RGBD Images. In Proceedings of the Computer Vision–ECCV 2012, Florence, Italy, 7–13 October 2012; Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 746–760. [Google Scholar]
Huang, X.; Wang, P.; Cheng, X.; Zhou, D.; Geng, Q.; Yang, R. The ApolloScape Open Dataset for Autonomous Driving and Its Application. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2702–2719. [Google Scholar] [CrossRef]
Wang, S.; Bai, M.; Máttyus, G.; Chu, H.; Luo, W.; Yang, B.; Liang, J.; Cheverie, J.; Fidler, S.; Urtasun, R. TorontoCity: Seeing the World with a Million Eyes. arXiv 2016, arXiv:1612.00423. [Google Scholar] [CrossRef]
Kondermann, D.; Nair, R.; Honauer, K.; Krispin, K.; Andrulis, J.; Brock, A.; Güssefeld, B.; Rahimimoghaddam, M.; Hofmann, S.; Brenner, C.; et al. The HCI Benchmark Suite: Stereo and Flow Ground Truth with Uncertainties for Urban Autonomous Driving. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 19–28. [Google Scholar] [CrossRef]
Li, J.; Wang, P.; Xiong, P.; Cai, T.; Yan, Z.; Yang, L.; Liu, J.; Fan, H.; Liu, S. Practical Stereo Matching via Cascaded Recurrent Network with Adaptive Correlation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 16242–16251. [Google Scholar] [CrossRef]
Yang, G.; Manela, J.; Happold, M.; Ramanan, D. Hierarchical Deep Stereo Matching on High-Resolution Images. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5510–5519. [Google Scholar] [CrossRef]
Camera Calibration and 3D Reconstruction-OpenCV 4.13.0 Documentation. Available online: https://docs.opencv.org/4.x/d9/d0c/group__calib3d.html (accessed on 17 October 2025).
Zhang, Z. A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1330–1334. [Google Scholar] [CrossRef]
Bay, H.; Ess, A.; Tuytelaars, T.; Gool, L.V. Speeded-Up Robust Features (SURF). Comput. Vis. Image Underst. 2008, 110, 346–359. [Google Scholar] [CrossRef]
Liu, R.; Zhang, H.; Liu, M.; Xia, X.; Hu, T. Stereo Cameras Self-Calibration Based on SIFT. In Proceedings of the 2009 International Conference on Measuring Technology and Mechatronics Automation, Zhangjiajie, China, 11–12 April 2009; Volume 1, pp. 352–355. [Google Scholar] [CrossRef]
Leutenegger, S.; Chli, M.; Siegwart, R.Y. BRISK: Binary Robust invariant scalable keypoints. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2548–2555. [Google Scholar] [CrossRef]
Calonder, M.; Lepetit, V.; Strecha, C.; Fua, P. BRIEF: Binary Robust Independent Elementary Features. In Proceedings of the Computer Vision–ECCV 2010, Crete, Greece, 5–11 September 2010; Daniilidis, K., Maragos, P., Paragios, N., Eds.; Springer: Berlin/Heidelberg, Germany, 2010; pp. 778–792. [Google Scholar]
Alcantarilla, P.F.; Nuevo, J.; Bartoli, A. Fast Explicit Diffusion for Accelerated Features in Nonlinear Scale Spaces. In Proceedings of the British Machine Vision Conference, Bristol, UK, 9–13 September 2013. [Google Scholar]
Karami, E.; Prasad, S.; Shehata, M. Image Matching Using SIFT, SURF, BRIEF and ORB: Performance Comparison for Distorted Images. arXiv 2017, arXiv:1710.02726. [Google Scholar] [CrossRef]
OpenCV 4.13.0-dev: Basics of Brute-Force Matcher. Available online: https://docs.opencv.org/4.x/dc/dc3/tutorial_py_matcher.html (accessed on 23 November 2025).

Figure 1. Sample data from the KITTI dataset: (a) an image from a stereo camera; (b) ground truth provided for the image presented in part (a).

Figure 2. Sample data from the MS2 dataset: (a) an image from a stereo camera; (b) ground truth provided for the image presented in part (a).

Figure 3. A sample arrangement of objects causing the occurrence of an occluded area: (a) an image from a left camera, (b) an image from a right camera, and (c) ground truth containing an occluded area marked with black color obtained for the image presented in part (a).

Figure 4. Sample ground truth with missing disparities from the Middlebury dataset.

Figure 5. Sample data from the ETH3D dataset: (a) an image from the left camera; (b) ground truth provided for the image presented in part (a).

Figure 6. Sample data from the PlantStereo dataset: (a) an image from the left camera, (b) an image from the right camera, and (c) ground truth provided for the image presented in part (a).

Figure 7. Sample data from the Monkaa dataset: (a) an image from the left camera, (b) an image from the right camera, (c) ground truth provided for the image presented in part (a), and (d) the map of occlusions generated on the basis of (5) for the image presented in part (a).

Table 1. The quantity of different kinds of stereo images in the PlantStereo dataset.

	Pepper	Pumpkin	Spinach	Tomato
training	180	100	900	100
testing	32	50	100	50

Table 2. Datasets containing pairs of images and ground truth acquired from LIDAR.

Dataset	Amount	Content	Date
KITTI [13]	394	streets in Germany	2015
MS2 [27]	180,000	streets in China	2023
MVSEC [28]	n.a.	streets and indoor areas	2018
DrivingStereo [29]	174,437	streets in China	2019
Argoverse [30,31,32]	over 1000	streets in U.S.	2019
TartanDrive 2.0 [33]	7 h of videos	off-road areas	2024
Oxford RobotCar [34,35]	100 videos	streets in UK	2016
EuRoC MAV [36]	n.a.	data from a drone, ground truth in the form of a 3D scan	2016
Road Surface Dataset [37]	300	views of bumps on roads	2023

Table 3. Synthetic datasets containing pairs of images and ground truth.

Dataset	Amount	Content	Technology	Release Date
FoundationStereo [39]	1 million	large diversity and high photorealism	NVIDIA Omniverse	2025
Falling Things [40]	60,000	household objects	Unreal Engine 4	2025
FlyingThings3D, Driving, Monkaa [41]	39,000	animated objects	Blender	2016
Spring [42,43]	6000	open-source animated movie “Spring”	Blender	2023
MPI Sintel [44]	1000	animated movie	Blender	2012
Virtual KITTI 2 [45,46]	over 10,000	photo-realistic street views inspired by KITTI	Unity	2020
KITTI-CARLA [47,48]	5000	street views resembling data from KITTI	CARLA simulator	2021
IRS [49]	100,000	naturally looking indoor spaces	Unreal Engine 4	2021
Dynamic Replica [50,51]	524 videos	humans and animals in everyday scenes	Facebook Replica	2023

Table 4. Results of evaluating the calibration of images in the PlantStereo dataset.

Dataset	Size	DCB	SDC	min Y disc	max Y disc
pepper testing	32	0.71	0.83	−0.038	0.767
pepper training	180	0.79	0.79	−0.758	0.933
pumpkin testing	50	0.93	0.83	−0.260	0.452
pumpkin training	100	0.89	0.85	−0.238	0.614
spinach testing	100	0.99	0.88	−0.320	0.278
spinach training	900	0.99	0.83	−0.607	0.730
tomato testing	50	0.9	0.87	−0.171	0.539
tomato training	100	0.88	0.85	−0.3	0.561
total	1512	0.93	0.79	−0.748	0.933

Table 5. Results of the evaluation for different testbeds.

Datasets	DCB	SDC	POA	CPL	Notes
Middlebury	0.96	0.85	1	0.96
ETH3D	0.99	0.87	1	0.6
Booster	0.83	0.24	1	0.95
PlantStereo	0.93	0.79	0.51	0.86
KITTI	0.93	0.88	1	0.19	sparse GT
MS2	0.72	0.89	1	0.17	sparse GT
Scene Flow (Mayer et al. [41])	1	1	0	1	synthetic
FoundationStereo	1	1	0	1	synthetic
MPI Sintel	1	1	1	1	synthetic
Other synthetic datasets	1	1	either 1 or 0	1	synthetic

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kaczmarek, A.L. Training Data for Stereo Matching Algorithms Based on Neural Networks and a Method for Data Evaluation. Appl. Sci. 2025, 15, 12663. https://doi.org/10.3390/app152312663

AMA Style

Kaczmarek AL. Training Data for Stereo Matching Algorithms Based on Neural Networks and a Method for Data Evaluation. Applied Sciences. 2025; 15(23):12663. https://doi.org/10.3390/app152312663

Chicago/Turabian Style

Kaczmarek, Adam L. 2025. "Training Data for Stereo Matching Algorithms Based on Neural Networks and a Method for Data Evaluation" Applied Sciences 15, no. 23: 12663. https://doi.org/10.3390/app152312663

APA Style

Kaczmarek, A. L. (2025). Training Data for Stereo Matching Algorithms Based on Neural Networks and a Method for Data Evaluation. Applied Sciences, 15(23), 12663. https://doi.org/10.3390/app152312663

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Training Data for Stereo Matching Algorithms Based on Neural Networks and a Method for Data Evaluation

Abstract

1. Introduction

2. Related Work

2.1. 3D Scanning Technologies

2.2. Main Features of Stereo Matching Technology

2.3. Datasets

2.3.1. Datasets with Dense Ground Truth

2.3.2. Datasets with Sparse Ground Truth

2.3.3. Synthetically Rendered Datasets

2.3.4. Other Datasets

3. Materials and Methods

3.1. Estimating the Quality of Calibration

3.2. The POA Metric

3.3. CPL Metric

4. Results

4.1. Evaluation of the Middlebury Dataset

4.2. Evaluation of ETH3D

4.3. Evaluation of Booster

4.4. Evaluation of PlantStereo

4.5. Evaluation of KITTI

4.6. Evaluation of MS2

4.7. Evaluation of Synthetic Datasets

4.8. Evaluation Summary

5. Discussion

6. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI