1. Summary
3D reconstruction from a photogrammetry standpoint is the process of inferring geometric features of a scene from a set of pictures. These geometric features include depth information from the original scene, which is unavoidably lost when capturing images. There is a broadly known and well-established process to perform this task called the photogrammetric pipeline, ref. [
1]; nevertheless, several stages of this process require handcrafted tasks to be performed. We introduce salient object detection (SOD) as a key topic for 3D reconstruction, since this technique enables the filtering of non-relevant information from images by identifying the regions that belong to the objects of interest, which are known as the proto-object. Formally, the SOD problem is defined as follows: Given a set of images
and their corresponding proto-objects
, the SOD model establishes a relationship between an image (
) and the proto-object (
). This information filtering technique becomes convenient for feeding only useful 2D information to the photogrammetric pipeline, thus avoiding manual postprocessing in the 3D domain. Therefore, the SOD3D dataset allows researchers to evaluate the impact of an SOD stage applied to 3D reconstruction and contrast the results between multiple visual attention algorithms. In this work, we provide as examples one symbolic approach and three other sub-symbolic approaches based on deep learning (DL) techniques.
The SOD3D dataset is designed to test the application of SOD to enhance the photogrammetric 3D reconstruction; thus, we explored the state of the art in datasets related to these two problems: SOD and photogrammetric 3D reconstruction. For salient object detection, the benchmark datasets were developed in the previous decade. Ref. [
2] was presented in 2015 with the main goal of addressing the limitations of earlier datasets, which were considered to be simple and not representative of real-world conditions. Later that year, ref. [
3] was introduced, seeking to add complexity regarding foreground–background contrast and multiple salient objects. One of the largest and most widely used benchmarks is ref. [
4], developed in 2017, claiming to be more challenging in terms of image complexity. Later, in 2022, ref. [
5] was presented, introducing new challenges like blurry images, object occlusion, and background clutter, with the remarkable new consideration of having no salient objects in some of the presented images. In addition to the fact that these SOD datasets lack high-resolution images, none of them is focused on the problem of 3D reconstruction. However, more recently introduced datasets do address the problem of 3D reconstruction. Ref. [
6] was presented by Meta in 2021, including over 1.5 million images from 50 different object categories, making it the most extensive dataset designed for the 3D reconstruction problem. Ref. [
7] was later introduced in 2023, focused on a building with challenging textures. Introduced in 2025 for the medical field, ref. [
8] addresses the problem of non-invasive monitoring of chronic wounds. Besides the fact that the camera network for acquiring the images for these 3D datasets is not standardized, none of them includes manually segmented versions of the images where the object of interest is isolated from the background, since they do not consider SOD as part of the 3D reconstruction problem. From the state of the art, we can see that the uniqueness of the proposed dataset comes from identifying an overlap and addressing critical concerns from both the SOD and 3D reconstruction problems: high-resolution imagery, standardized acquisition, and manually segmented ground truths.
In summary, the public release of this dataset offers several key advantages for researchers working in the field of photogrammetric 3D reconstruction:
To the best of our knowledge, our proposal establishes a benchmark through a high-resolution dataset of original and processed images specifically designed for studying visual attention algorithms applied to photogrammetric 3D reconstruction.
The dataset was developed by acquiring original images in a controlled and standardized manner, allowing researchers to fairly assess the performance of different SOD algorithms from a 3D reconstruction perspective.
The release of this data enables researchers to measure the impact of an SOD stage on 3D reconstruction and compare multiple visual attention algorithms: one based on a symbolic approach and three additional sub-symbolic approaches based on DL techniques.
By including both automatically produced saliency maps and manually generated ground truth images, the dataset provides the necessary baselines to fairly evaluate the performance of different SOD algorithms against a defined metric.
2. Data Description
SOD3D is a testing dataset built to evaluate the impact of an SOD stage applied to 3D reconstruction. It consists of images of 28 objects of different sizes, shapes, and textures, as well as their processed versions. Starting with 36 original photographs taken for each object from various points of view, a manually segmented ground truth is included for each original image as a baseline for SOD algorithm evaluation. Furthermore, saliency maps generated with four SOD algorithms, one symbolic, based on the brain programming methodology, and three sub-symbolic, based on DL techniques, are included. Additionally, binarized versions of each saliency map, referred to as proto-objects, are included to evaluate the performance of the SOD algorithms. Furthermore, overlays obtained by masking the original images with the proto-objects are included as an additional image set. Finally, a set of 3D reconstructions is provided in the form of point clouds. The repository’s root contains a directory for each object. Inside these directories are folders for the original images and their processed versions. The generic folder structure of the repository is depicted in
Figure 1.
Inside each object’s folder, a total of six folders contain the original images, the manually segmented ground truth, the saliency maps generated by each SOD algorithm, the proto-objects derived from the saliency maps, the overlays generated from masking the original images with the proto-objects, and the final reconstructions resulting from each object after applying each proposed technique. As shown in
Figure 1, the Original and GroundTruth folders contain the raw images and their manually segmented versions. Then, the SaliencyMap and ProtoObject folders share the same structure, where an independent directory is reserved for each SOD algorithm. Similarly, Overlay includes a folder for the data derived from each SOD algorithm, with the addition of a folder called GroundTruth, which contains the data derived from processing the original images using the manually segmented ground truth, to serve as a reference. Finally, the Reconstruction folder contains the available 3D reconstructions in the form of point clouds for each applied technique. Each file name includes a prefix that contains identifiers that encode the related object and the corresponding processing stage.
Table 1 shows the naming conventions and file formats for each stage in the pipeline.
3. Methods
This section, divided into two parts, outlines the methodology proposed to acquire and process the data. It covers the initial setup for collecting raw data and the subsequent processing required to generate the final products included in the dataset.
3.1. Initial Setup
This section, divided into four parts, details how the images that constitute the core of the dataset were acquired. We will start by describing the camera used for the acquisition and the optical parameters selected for this purpose, then explain the environmental conditions and camera network configuration, and finally describe the objects selected for the dataset.
3.1.1. Optical Parameters
3D reconstruction in photogrammetry heavily relies on acquiring images where objects of interest appear crisp enough for the features on their surface to be properly detected, described, and identified. For this purpose, it is crucial that the whole surface of the object of interest appears sharp and in focus in each image. To meet this requirement, an appropriate camera and optical parameter configuration must be selected.
The camera setup selected for the image acquisition task was a Canon EOS 5D Mark III with a Canon EF 24–105 mm f/4 IS USM zoom lens. This camera has a full-frame sensor delivering a maximum spatial resolution of 5760 × 3840 pixels. The zoom lens can be adjusted to any focal length between 24 and 105 mm.
The depth of field is the distance between the closest and farthest objects that appear acceptably sharp in an image. To acquire images where the object of interest appears entirely focused, the depth of field must be deep enough to cover the entire object of interest, at the very least, regardless of camera position and attitude during the acquisition. Since the focus distance was set to be constant and relatively short (1 m) due to space limitations, the aperture and focal length were the parameters to be adjusted. The focal length was kept at relatively small values considering the short focus distance; however, to minimize perspective distortion, the shortest available focal length was avoided as much as possible, depending on the dimensions of the object of interest. Thus, focal length values between 24 and 35 mm were set during the acquisition. The only remaining optical parameter related to focus was the aperture, which is crucial for defining the image’s exposure.
The so-called exposure triangle comprises three key camera settings that work together to determine how light is captured and directly impact the final image exposure. The three parameters are the aperture, sensitivity, and shutter speed. The aperture consists of the actual diaphragm opening that controls the amount of light impacting the sensor. The sensitivity is the level of amplification of the amount of light detected by the sensor. The shutter speed is the period the sensor remains exposed to light during the capture. There is a trade-off between these three parameters since they are highly interrelated, and adjusting each one directly impacts the final exposure in the image.
To avoid known noise effects (e.g., grainy patterns in low-contrast areas) in the final image, sensitivity was adjusted to a low value of 200, which is also recommended for artificially illuminated building interiors. To ensure a suitable depth of field, an aperture value of f/8 was set. Low light levels resulting from the selected small aperture had to be compensated through longer exposure times, which were variable and automatically selected on each capture depending on the scene light conditions. For this purpose, the aperture priority setting was used during the whole acquisition. Finally, the high dynamic range (HDR) setting was active for each capture to ensure a homogeneous level of detail over the entire image. In this mode, three images with different exposure levels are acquired for each capture and automatically mixed to obtain an HDR image. Starting from an image with an ideal exposure value of 0, a sub-exposed image (exposure value of −3) is acquired to retain details in bright areas, and an over-exposed image (exposure value of +3) is also acquired to keep details in dark areas. Then, the three images are automatically merged into a single capture with evenly spread details across the whole image. This process is performed internally by the Canon EOS 5D Mark II as part of its HDR capabilities.
3.1.2. Camera Network Configuration
Photographs from different camera positions were acquired for each object with the purpose of recovering its three-dimensional features.
Figure 2 depicts the camera network configuration used for the captures from side and top views. The camera positions were arbitrarily defined to maximize the object’s surface coverage while keeping them standard across the whole dataset despite variability in object shapes. For this purpose, 36 camera positions were defined over a 1 m radius spherical surface, where three imaginary rings were proposed: 0°, 30°, and 60° from the object’s vertical center. In the lower ring, which is 0° from the object’s vertical center (blue in
Figure 2), 16 camera positions were established, 22.5° from one another. For the middle ring, which is 30° from the object’s vertical center (purple in
Figure 2), 12 camera positions were established, 30° from one another. Finally, for the top ring, which is 60° from the object’s vertical center (red in
Figure 2), eight camera positions were established, 45° from one another. This camera network configuration maintains a constant 1 m distance from the object’s vertical center for each photograph. Camera positions and orientations for the proposed camera network configuration were fixed manually, aided by markings located on the floor and a camera tripod with an adjustable height feature.
3.1.3. Object Selection
The dataset includes pictures of 28 different objects. The object list was carefully crafted, including inanimate objects of various sizes, shapes, and surface textures. Object sizes were defined considering space availability and optical constraints. Object heights ranging from 15 to 60 cm were selected to meet the aforementioned conditions. The shapes of the selected objects vary from low complexity for those whose shapes are similar to basic primitives (cubes, spheres, prisms, etc.) to medium to high complexity for those whose shapes are more organic (human-like, articulated, and so on). For the textures, objects made of different materials were selected to evaluate the impact of those features on the final 3D reconstruction while keeping reflective surfaces to a minimum, considering that this could cause problems during reconstruction due to poor feature extraction.
Figure 3 shows example captures for the selected objects.
3.1.4. Environmental Conditions
3D reconstruction in photogrammetry heavily relies on acquiring images where the objects’ surface features are visible so they can be properly detected, described, and identified. For this purpose, a suitable environment for object photography was selected, aiming to have standard light conditions across the whole image acquisition stage. The place selected for this task was a building interior space with artificial cold white light, where the background remains static for each of the objects’ image sets. Slight changes in background configuration (e.g., background objects’ location and position) between different object sets were allowed. To ensure a variable background setting, the objects were placed on 28 paper wraps of different colors, textures, and patterns, intentionally causing consistent changes in the background between different objects. Hence, we associate each object with a unique paper wrap.
3.2. Data Processing
This section provides a description of the raw data processing procedures required to generate the final products included in the dataset. First, the background removal methodologies are presented, encompassing both manual processing for ground truth generation and automated approaches based on SOD algorithms. Next, the procedures for image binarization and overlay generation are described. Finally, the 3D reconstruction process is outlined.
3.2.1. Background Removal
This subsection is divided into three parts: ground truth, SOD algorithms, and binarization. The database includes manually segmented ground truth for each of the original images to serve the purpose of evaluating SOD algorithms. Furthermore, automatically segmented versions of the original images are included. Each set of images was processed with three sub-symbolic SOD algorithms based on DL and a symbolic SOD algorithm based on the brain programming paradigm.
Ground Truth
Image ground truth is obtained through a manual process where a contour is selected to segregate the group of pixels that belong to the object of interest (foreground) from those that do not (background). This action is performed in Adobe Photoshop using the Pen Tool. To have clear visibility between pixels, a zoom of 400% on a 24-inch Full HD monitor is used for the contour. After contouring the object, the loop between the initial and final points is closed. The contoured closed loop groups the pixels that belong to the foreground, while the excluded pixels comprise the background. Finally, a layer mask is applied to the loop so that the regions that constitute the object of interest appear in white, while the remaining areas appear in black in the final image. For some objects, multiple contours might be needed depending on their shape. This process can take up to 15 min per image, requiring up to 9 h to process each object’s set of 36 images, depending on the object’s complexity.
Figure 4 depicts the ground truth creation process, which essentially lies in defining which pixels belong to the object of interest and distinguishing them from those that constitute the background. As shown in
Figure 4, there is a trade-off between pixel detail and context, which is essential for properly selecting regions. As the zoom increases, independent pixels can be accurately identified, but the context of the image regions is lost.
SOD Algorithms
This section compares a symbolic algorithm under the brain programming paradigm with three different sub-symbolic algorithms based on DL techniques.
GBVSBP is an evolutionary computation paradigm that mimics the inner workings of the visual cortex, ref. [
9]. Specifically, to solve the SOD task, the dorsal stream functions are replicated to create computational models able to segregate regions of the image that belong to the object of interest from those that belong to the background. The strategy adheres to a goal-oriented paradigm wherein learning is conceptualized as a symbolic optimization, whereby an individual is characterized by a template that delineates the visual attention (VA) model while simultaneously identifying essential components of the algorithm via artificial evolution. We selected this method because it represents a symbolic approach to SOD.
Introduced in 2022, adversarial robust SOD networks with a learnable noise (LeNo) module consist of a shallow noise inspired by the VA mechanism embedded in the encoder, initialized with a cross-shaped Gaussian distribution, and a noise estimation affecting only a single channel of the decoder rather than adding more network elements for postprocessing, ref. [
10]. This technique contributes to the SOD’s increased robustness by outperforming earlier research while also improving inference speed. We selected this network due to the authors’ claim that LeNo is robust to noise.
Proposed in 2022, SOD via the Extremely Downsampled Network (EDN) applies a strategy to learn a global view of the complete image, resulting in precise salient object localization, ref. [
11]. A scale-correlated pyramid convolution is constructed to improve multi-level feature fusion and recover object details from the extreme downsampling, achieving state-of-the-art performance and real-time speed. We selected this network to test its repeatability against high-resolution images.
The Inverse Saliency Pyramid Reconstruction Network (InSPyReNet) was added to our work due to its continued competitive performance against more recent SOD models, offering an effective balance between state-of-the-art accuracy and computational efficiency under the resource constraints and testing methodology of this study. It was proposed in 2022 and introduced an image pyramid-based framework for high-resolution (HR) SOD without requiring HR training datasets, ref. [
12]. The network produces a strict image pyramid structure of a saliency map that enables pyramid-based image blending, achieved with a dedicated method that synthesizes low-resolution (LR) and HR image pyramids to mitigate effective receptive field discrepancies during HR prediction. Multiple SOD metrics and boundary accuracy measure evaluations performed on public LR and HR SOD benchmarks demonstrated that InSPyReNet surpassed state-of-the-art methods.
To establish a baseline, the image database used to train the aforementioned models is known as FT (frequency tuned). This database was initially introduced by Achanta et al. and is still used to study the SOD problem. FT’s ground truth is obtained by performing a manual object-contour segmentation over images of animate and inanimate objects, ref. [
13]. We selected this database due to the similarity to our dataset, although we use high-resolution images of inanimate objects captured from different perspectives. This aspect is relevant since the information registered in our dataset contains projections of a three-dimensional scene, while all other datasets studied within SOD portray an image processing problem.
Binarization
This section explains the process needed to convert the standard output of an SOD algorithm, which is a grayscale image, to a binary output. Depending on the type of algorithm, the resulting image can have different characteristics. For the case of algorithms based on DL, which are unable to work with images at their original size, the input is first scaled to be fed to the neural networks. This unavoidably causes the aspect ratio to be lost and spatial distortion to be introduced. Naturally, the output is also affected by these transformations, resulting in the original salient object image being a square-scaled version of the expected output. At the end of the process, the algorithms based on DL rescale the output image to the size of the original input image, recovering the aspect ratio but introducing additional distortion as a result of this operation. Furthermore, when stretching the original output, some spatial properties of the image are lost since object edges change from regions where a radical transition between white and black pixels is observed to gradients with multiple shades between black and white.
The symbolic SOD algorithm based on the brain programming paradigm is able to work with images at their original size, and its output is a saliency map, which is a grayscale image where multiple shades of gray denote different attention levels. Thus, regions marked in black indicate the lowest attention level, while those marked in white indicate the highest attention level. Any value in between indicates an intermediate attention level, with brighter values representing higher attention and darker values representing lower attention.
Considering the nature of the output images for each SOD algorithm (saliency maps), an image binarization process is necessary to generate the corresponding proto-objects. Hence, the regions in the image that belong to the salient object appear in white, while those that belong to the background appear in black. To this end, a binarization level is selected to establish the threshold for setting a pixel to white or black in the final image. To establish a fair binarization level for each algorithm output, an optimal threshold is selected by finding the value that maximizes the overlap between the binarized saliency map and the manually segmented ground truth. Algorithm 1 explains how the optimal threshold is defined and used to generate the corresponding proto-object from an input saliency map.
| Algorithm 1: Generate proto-object |
Purpose: Create a proto-object by binarizing a saliency map with the optimal threshold. Input:Functions:Binarize (image, threshold): Returns the input image binarized by the input threshold. FMeasure (image1, image2): Returns the F-Measure value from the two input binary images. Variables:threshold: Stores a value used for binarization. optimal_threshold: Stores the best threshold. f_measure: Stores an F-Measure result. max_f_measure: Stores the maximum F-Measure. binarized_image: Stores a binarized image. Output: 1: f_measure ← 0 2: max_f_measure ← 0 3: best_threshold ← 0 4: for (threshold ← 1 to 255) do 5: binarized_image ← Binarize (saliency_map, threshold) 6: f_measure ← FMeasure (binarized_image, groundtruth) 7: if f_measure > max_f_measure then 8: max_f_measure ← f_measure best_threshold ← threshold 9: end if 10: end for 11: proto_object ← Binarize (saliency_map, threshold) 12: return proto_object |
More formally, considering the ground truth image
, where
and a proto-object
, where
, it is necessary to find the value that maximizes the overlap between
and
. For this, we use the
measure defined as follows:
where
is precision
,
is recall
, and
is the parameter that controls the balance between
and
,
.
measures the effectiveness of the overlap considering
to emphasize precision following the standard protocol [
9]. Then, the overlap between
and
is given by the following relation, considering the pair
and
thresholds:
The maximum argument that maximizes the function is then calculated. We repeat this process for the different ground truths corresponding to each image.
Figure 5 shows an example of an image processed with the four different SOD algorithms. There, a clear performance comparison between automatic segmentation techniques can be observed.
3.2.2. Overlay Generation
The primary purpose of the SOD3D dataset is to evaluate the impact of an SOD stage applied to 3D reconstruction. To achieve this, the proto-objects are then used to mask the original images acquired by the camera. As a result, we obtain an image in which all the regions belonging to the background are masked in black, and those that belong to the object of interest retain their original pixel values. These images are called overlays and are the ones to be fed to the photogrammetric pipeline for 3D reconstruction.
Figure 6 shows an example of an overlay generated by masking an original image with a proto-object.
3.2.3. 3D Reconstruction
The SOD3D dataset was developed to measure the impact of an SOD stage applied to 3D reconstruction. To this aim, the database includes 3D point clouds that represent the reconstructed objects using the traditional photogrammetry pipeline from each applicable set of images.
The fundamental principle of the photogrammetry pipeline relies on obtaining multiple views of the scene and then triangulating the locations between matching points to estimate their 3D coordinates. This process is incrementally extended to additional views with the aim of constructing a 3D point cloud. The default workflow in Meshroom was used to apply the photogrammetric pipeline over the input images to perform the 3D reconstruction. Based on this workflow, this pipeline can be divided into some critical stages, which are outlined below:
Feature extraction: This stage establishes the foundation for finding the relative pose of cameras in 3D space. Its primary goal is to extract distinctive groups of pixels that remain robust to viewpoint variations during image acquisition. Consequently, a scene feature should produce consistent descriptors across all captured images. The SIFT (Scale-Invariant Feature Transform) algorithm, ref. [
14], is employed to extract keypoints and generate a set of image feature descriptors. Texture complexity may vary significantly both across different images and within local regions of the same image, causing the number of extracted features to fluctuate considerably. To address this issue, a post-filtering step is applied to limit the total number of features to a reasonable amount. Finally, grid filtering is used to ensure a homogeneous distribution of feature points throughout the image.
Image matching: The objective of this stage is to identify visual overlap between images, based on their shared feature content. To achieve this, image retrieval techniques are employed to identify images with overlapping content, avoiding the high computational cost of exhaustive feature matching. This is accomplished by representing each image as a compact descriptor, allowing for highly efficient distance evaluations between images. A widely used method for generating global image descriptors is the vocabulary tree approach, ref. [
15]. Extracted image feature descriptors are propagated down the tree, hierarchically classified at each node, and assigned to a specific leaf node. Each descriptor is thus reduced to a simple leaf index, and the final image descriptor is then formed by collecting these leaf indices. Finally, the shared feature content between images can be evaluated by comparing these descriptors.
Feature matching: In this stage, features are matched between image pairs, which will subsequently provide the geometric foundation to reconstruct an initial 3D structure from the underlying 2D data. First, photometric matching between the descriptors of the two input images is performed. Each feature in image A is mapped to a list of candidate features from image B. This list is then refined under the assumption that only one valid match exists between the images. Specifically, for each descriptor in the first image, its two closest neighbors in the second image are identified, and a relative distance ratio threshold, ref. [
16], is applied. This process yields a list of candidate matches based purely on photometric criteria. Since finding the two closest descriptors for each feature is computationally expensive using a brute-force approach, optimized algorithms are typically employed; while Approximate Nearest Neighbor is the most common, alternatives like Cascading Hashing are also widely used. Geometric filtering is performed using epipolar geometry within the random sample consensus (RANSAC) framework for outlier rejection, ref. [
17]. By randomly selecting a small set of feature correspondences, the fundamental matrix is computed. Then the number of features that conform to this geometric model is evaluated and iterated through the RANSAC loop to find the optimal consensus set.
Structure from motion: This stage represents the core of the whole photogrammetry pipeline. It analyzes the geometric relationships between 2D input images to reconstruct a 3D model of the scene and simultaneously determine the pose and internal calibration of all cameras. The incremental pipeline is an iterative reconstruction process that begins with an initial two-view reconstruction and sequentially extends it by integrating new views. To initiate this, feature matches across image pairs are fused into tracks, where each track ideally represents a unique 3D point in space observed by multiple cameras. However, at this stage of the pipeline, the tracks still contain a significant number of outliers. To mitigate this, inconsistent tracks are filtered out during the fusion process. Next, the incremental pipeline must select the optimal initial image pair, which is crucial for a high-quality final reconstruction. This baseline pair must provide robust feature matches and guarantee reliable geometric constraints. Consequently, the selection process prioritizes pairs that maximize both the total match count and the spatial distribution of these features across the image planes. Simultaneously, the baseline must maintain a sufficiently wide triangulation angle between the camera views to ensure robust 3D triangulation. Then, the fundamental matrix between these two images is computed, setting the first camera as the origin of the 3D coordinate system. With the relative pose of the first two cameras established, the corresponding 2D feature matches are triangulated to generate the initial set of 3D points. Subsequently, the pipeline selects new images that share a sufficient number of correspondences with the existing 3D point cloud. This process is known as Next-Best-View selection. Utilizing these 2D–3D associations, the algorithm performs camera resectioning to estimate the pose of each new view. This resectioning step employs a Perspective-n-Point algorithm embedded within a RANSAC framework to robustly determine the camera pose that yields the highest consensus of feature matches. A final non-linear minimization step is then applied to refine each camera pose. Following the estimation of these new camera poses, tracks that become visible across two or more resected views are triangulated into 3D points. Next, a global Bundle Adjustment is executed to optimize all parameters together, including the camera intrinsics, extrinsics, and 3D point positions. To maintain reconstruction accuracy, the Bundle Adjustment results are filtered by removing observations that exhibit a high reprojection error or an insufficient triangulation angle. The triangulation of these new points expands the available candidate images for subsequent Next-Best-View selection. This process is executed iteratively: integrating new camera views, triangulating newly observed 2D features into 3D points, and filtering out any 3D points that become geometrically invalidated. This optimization loop continues until no remaining camera views can be localized, ref. [
18].
Depth map estimation: In this stage, a depth value for each input pixel is estimated. This is achieved by analyzing the similarities between neighboring cameras in volumetric regions around the intersection of their optical axes in the 3D space. For each reference image, the
nearest neighboring cameras are selected. The fronto-parallel planes are selected based on the intersection of the reference optical axis with the pixels of these neighboring cameras. This plane-sweeping approach constructs a matching volume of dimensions
,
,
, representing multiple depth candidates per pixel. The matching similarity for all depth candidates is then evaluated using zero-mean normalized cross-correlation computed over a small patch from the reference image reprojected onto the neighboring views. This process generates a raw similarity volume, where photometric costs from each neighboring view are accumulated. Because this volume is inherently noisy, a spatial filtering step is applied along the
and
axes to aggregate local costs. This filtering effectively suppresses isolated high-similarity outliers. From this regularized volume, the optimal depth is determined by selecting the local minima, mapping the chosen plane indices to their corresponding continuous depth values to populate an initial depth map. Since this depth map is restricted to the discrete intervals of the sampled planes, it exhibits banding artifacts. To resolve this, a sub-pixel refinement step is applied to achieve continuous and precise depth estimations. Finally, a filtering step is applied to ensure consistency between multiple cameras, ref. [
19].
Meshing: This stage merges all depth maps into a single dense point cloud and then establishes a relationship between the points to create a surface. First, the individual depth maps are fused into a global octree structure, where geometrically compatible depth values are merged into unified octree cells. Then a 3D Delaunay tetrahedralization is performed over the resulting point cloud. Next, a visibility-based voting procedure is executed to compute weights for both the tetrahedral cells and the facets connecting them, ref. [
20]. A Graph Cut Max-Flow algorithm, ref. [
21], is subsequently applied to solve for the optimal volumetric segmentation, where the minimum cut defines the extracted manifold mesh surface. After discarding noisy or isolated cells along the boundary, Laplacian filtering is applied to suppress localized mesh artifacts. Finally, a mesh simplification step is performed to drastically reduce the vertex count while preserving structural features.
The 3D point clouds included in the SOD3D dataset are generated from the original images and the overlays obtained by masking them with the ground truth and the different proto-objects derived from the SOD algorithms. In this manner, six different 3D reconstructions for each object are expected to be generated from each set of images: the original images, the ground truth overlay, the GBVSBP overlay, the LeNo overlay, the EDN overlay, and the InSPyReNet overlay.
Appendix B provides details on the parameter configuration used at each stage of the photogrammetry pipeline in Meshroom to generate the corresponding point clouds.
Note that certain image sets lack sufficient identifiable features for full reconstruction; consequently, there is no one-to-one correspondence between image sets and 3D point clouds. Despite these limitations, the dataset includes 153 reconstructed 3D models that represent the full diversity of the objects.
Figure 7 illustrates examples of 3D models derived from the dataset’s image sets via a traditional photogrammetry pipeline.
4. Evaluation Metrics
This section outlines the proposed framework used to evaluate the quality of the 3D reconstructions generated by each method. It details the entire procedure, beginning with point cloud preprocessing to ensure a fair comparison and concluding with the specific metrics used for the assessment.
The 3D reconstructions generated by the methods described in the previous sections consist of points positioned and interconnected in 3D space. Assessing the quality of these point clouds requires a direct comparison against a known reference. For this purpose, we generated a 3D ground-truth point cloud by reconstructing a specialized set of image overlays, which were generated by masking the original images with the manually segmented ground truth.
To evaluate the geometric similarity between the reconstructed and reference point clouds, we measure the alignment of their 3D coordinates. The metric selected for this purpose is the Chamfer Distance (CD), which computes the average bidirectional nearest-neighbor distance between the two point sets. Due to its point-wise nature, CD is able to operate despite differences in point quantities between the clouds. Evaluating our 3D reconstructions via CD requires a preliminary alignment and registration step. Because CD does not inherently account for differences in pose—a condition present in our 3D reconstructions—the point clouds must share the same coordinate space and orientation to maximize point-to-point correspondence and guarantee a fair comparison. To achieve this, an initial orientation is calculated using Principal Component Analysis (PCA), a dimensionality reduction technique employed to identify the principal axes of the reconstructed point cloud for global coarse alignment. The alignment is then locally refined using the Iterative Closest Point (ICP) algorithm, which transforms the reconstructed point cloud to maximize correspondence with the target reference. Both PCA and ICP algorithms were implemented in Python 3.10.16, aided by the NumPy 1.26.3, Pandas 2.1.4, and Open3D 0.19.0 open-source libraries. It is important to note that a key limitation of the ICP algorithm is its high sensitivity to initial conditions; poor initial positioning of the source relative to the target point cloud can easily cause the algorithm to converge to a local minimum rather than the global optimum. To overcome this restriction and improve registration accuracy, we manually reoriented specific reconstructed point clouds prior to the automatic alignment steps. This manual operation was performed using CloudCompare 2.13.2, an open-source application. Once both the source 3D reconstructed and target reference point clouds achieve their final registered configuration, CD is finally computed to assess their geometric similarity.
Table 2 compares the proposed 3D ground truth with the reconstructed point clouds across all objects and techniques.
The comparison results between the proposed 3D ground truth and the reconstructed point clouds are summarized in
Table 2. The CD metric measures the average distance between the two point clouds; therefore, a lower value indicates higher similarity, while a higher value denotes greater discrepancy. Note that there are SOD algorithms where the CD metric is marked as “N/A” (not available) for some of the objects. This inconsistency arises from the lack of identifiable features in certain image sets, primarily due to poor segmentation, which prevents successful 3D reconstruction; consequently, no metric can be reported.
5. User Notes
This section addresses the key limitations of our dataset, including the effects of arbitrarily defining the camera positions, the controlled environment, and the SOD model selection and training.
This dataset is focused on photogrammetric 3D reconstruction, whose primary goal is to recover 3D geometric features from a set of scene images. For this purpose, 36 images acquired from different camera positions are included in the dataset for each object to be reconstructed in 3D. The camera positions were arbitrarily defined, aiming to maximize the object’s surface coverage while keeping them standard across the whole dataset, despite the variety of object shapes. This represents a drawback for accurate 3D reconstruction since, according to ref. [
22], carefully selecting these camera positions and attitudes minimizes errors in 3D geometric features, thereby enhancing 3D reconstruction quality. Future work can explore the correct camera network design according to each object’s morphology. Nevertheless, the creation of this dataset aims to study the impact of SOD on the 3D reconstruction; hence, this limitation does not void its value.
The original images were acquired under a controlled environment, where the illumination, distance, and camera positions were carefully selected. The dataset can be expanded to include pictures acquired under different environmental conditions (e.g., camera distances and positions, exterior environments, natural light, and so on) to measure their impact on the final 3D reconstruction.
In addition to the manually segmented ground truth for each original image, this dataset includes automatically segmented versions. These automatically segmented versions of the images were created using symbolic and sub-symbolic SOD algorithms. The FT image database was used to train each SOD approach applied in our work to establish a baseline. Further testing can be performed by diversifying the SOD algorithms applied to create this dataset and by training them with different databases.
Author Contributions
Conceptualization, A.B.R., G.O. and E.C.; methodology, A.B.R., G.O. and E.C.; software, A.B.R. and M.O.; validation, A.B.R., G.O., E.C. and M.O.; formal analysis, A.B.R., G.O. and E.C.; investigation, A.B.R.; resources, G.O. and E.C.; data curation, A.B.R. and M.O.; writing—original draft preparation, A.B.R.; writing—review and editing, A.B.R., G.O., E.C. and M.O.; visualization, A.B.R.; supervision, G.O. and E.C.; project administration, G.O. and E.C.; funding acquisition, G.O. and E.C. All authors have read and agreed to the published version of the manuscript.
Funding
The APC was funded by Centro de Investigación Científica y Educación Superior de Ensenada (CICESE), No. 31830 and Tecnológico Nacional de México/CENIDET, No. 24345.26-P.
Informed Consent Statement
Not applicable.
Data Availability Statement
Acknowledgments
The authors dedicate this work to the memory of our friend, colleague and mentor, Gustavo Olague. His scientific vision, guidance, generosity, and passion for research profoundly influenced our professional and personal development. His legacy continues to inspire our work and will remain an enduring source of motivation for future generations of researchers.
Conflicts of Interest
Author Matthieu Olague was employed by the company IBM Technology, Campus Guadalajara. However, there is no financial or any other type of relationship between the company and this research work. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| SOD | Salient Object Detection |
| DL | Deep Learning |
| HDR | High Dynamic Range |
| VA | Visual Attention |
| HR | High Resolution |
| LR | Low Resolution |
| CD | Chamfer Distance |
| PCA | Principal Component Analysis |
| ICP | Iterative Closest Point |
| SIFT | Scale-Invariant Feature Transform |
| RANSAC | Random Sample Consensus |
Appendix A. Dataset Overview
This appendix provides an overview of the dataset.
Table A1 includes the subject area, data format, how and where the data was collected, and how it can be accessed.
Table A1.
Dataset specifications table.
Table A1.
Dataset specifications table.
| Subject | Computer Sciences |
| Specific subject area | 3D Reconstruction, Photogrammetry, Image Segmentation. Computer Vision and Pattern Recognition, Applied Machine Learning, Computer Graphics, and Computer-Aided Design. |
| Type of data | Image (RGB JPG, Grayscale PNG, Binary PNG), 3D point cloud (CSV) |
| Data collection | This repository contains a total of 15,120 and 153 reconstructed 3D models. The dataset is divided into 28 sets of high-resolution images of different everyday objects. Each set comprises 36 images of each object, acquired from various camera positions with the primary goal of reconstructing the object through photogrammetric techniques. The 1008 original images were taken with a Canon EOS 5D Mark III camera and a 24–105 mm variable lens, keeping optical parameters as fixed as possible to standardize image acquisition. To evaluate salient object detection (SOD) algorithms, the database includes manually segmented ground truth for each original image (Adobe Photoshop CS6 13.0). Furthermore, the repository contains automatically segmented versions of the images processed with three state-of-the-art deep learning (DL)-based SOD algorithms (Python 3.10.16) and an additional set of images processed with a symbolic SOD algorithm based on the brain programming paradigm (Matlab R2017b). Finally, for 3D reconstructions, the database includes 3D point clouds for each object (Meshroom-2021.1.0). |
| Data source location | The FT database was recovered from [13]. The SOD3D database was collected from an office building located in Mexicali, Baja California, Mexico. |
| Data accessibility | Repository name: SOD3D: A salient object detection dataset for photogrammetric 3D reconstruction Data identification number: 10.17632/695w3dws5f.2 Direct URL to data: https://data.mendeley.com/datasets/695w3dws5f/2 (accessed on 26 May 2026) Instructions for accessing these data: The dataset is available at Mendeley Data. |
Appendix B. 3D Reconstruction Parameters
This appendix provides details on the parameter configuration used for generating the 3D point clouds from each set of images in Meshroom-2021.1.0.
Table A2 includes the name of each stage in shaded cells and the name and value for each parameter.
Table A2.
3D reconstruction parameter configuration.
Table A2.
3D reconstruction parameter configuration.
| Camera Init |
|---|
| Sensor Database | cameraSensors.db |
| Default Field Of View | 45 |
| Group Camera Fallback | Folder |
| Allowed Camera Models | Pinhole, radial1, radial3, brown, fisheye4, fisheye1 |
| Apply internal white balance | ☑ 1 |
| ViewId Method | metadata |
| Verbose Level | info |
| Feature Extraction |
| Describer Types | sift |
| Describer Density | ultra |
| Describer Quality | ultra |
| Contrast Filtering | GridSort |
| Grid Filtering | ☑ 1 |
| Force CPU Extraction | ☑ 1 |
| Max Nb Threads | 0 |
| Verbose Level | info |
| Image Matching |
| Method | VocabularyTree |
| Voc Tree: Tree | vlfeat_K80L3.SIFT.tree |
| Voc Tree: Minimal Number of Images | 200 |
| Voc Tree: Max Descriptors | 500 |
| Voc Tree: Nb Matches | 50 |
| Max Nb Threads | 0 |
| Verbose Level | info |
| Feature Matching |
| Photometric Matching Method | ANN_L2 |
| Geometric Estimator | acransac |
| Geometric Filter Type | Fundamental_matrix |
| Distance Ratio | 0.8 |
| Max Iteration | 2048 |
| Geometric Validation Error | 0 |
| Known Poses Geometric Error Max | 5 |
| Max Matches | 0 |
| Save Putative Matches | ☐ 2 |
| Cross Matching | ☐ 2 |
| Guided Matching | ☐ 2 |
| Match From Known Camera Poses | ☐ 2 |
| Export Debug Files | ☐ 2 |
| Verbose Level | info |
| Structure From Motion |
| Localizer Estimator | acransac |
| Observation Constraint | Basic |
| Localizer Max Ransac Iterations | 4096 |
| Localizer Max Ransac Error | 0 |
| Lock Scene Previously Reconstructed | ☐ 2 |
| Local Bundle Adjustment | ☑ 1 |
| LocalBA Graph Distance | 1 |
| Maximum Number of Matches | 0 |
| Minimum Number of Matches | 0 |
| Min Input Track Length | 2 |
| Min Observation For Triangulation | 2 |
| Min Angle For Triangulation | 3 |
| Min Angle For Landmark | 2 |
| Max Reprojection Error | 4 |
| Min Angle Initial Pair | 5 |
| Max Angle Initial Pair | 40 |
| Use Only Matches From Input Folder | ☐ 2 |
| Use Rig Constraint | ☑ 1 |
| Force Lock of All Intrinsic Camera Params. | ☐ 2 |
| Filter Track Forks | ☐ 2 |
| Inter File Extension | .abc |
| Verbose Level | info |
| Prepare Dense Scene |
| Output File Type | exr |
| Save Metadata | ☑ 1 |
| Save Matrices Text Files | ☐ 2 |
| Correct images exposure | ☐ 2 |
| Verbose Level | info |
| Depth Map |
| Downscale | 2 |
| Min View Angle | 2 |
| Max View Ange | 70 |
| SGM: Nb Neighbor Cameras | 10 |
| SGM: WSH | 4 |
| SGM: GammaC | 5.5 |
| SGM: GammaP | 8 |
| Refine: Nb Neighbor Cameras | 6 |
| Refine: Number of Samples | 150 |
| Refine: Number of Depths | 31 |
| Refine: Number of Iterations | 100 |
| Refine: WSH | 3 |
| Refine: Sigma | 15 |
| Refine: GammaC | 15.5 |
| Refine: GammaP | 8 |
| Refine: Tc or Rc pixel size | ☐ 2 |
| Export Intermediate Results | ☐ 2 |
| Number of GPUs | 0 |
| Verbose Level | Info |
| Depth Map Filter |
| Min View Angle | 2 |
| Max View Ange | 70 |
| Number of Nearest Cameras | 10 |
| Min Consistent Cameras | 3 |
| Min Consistent Cameras Bad Similarity | 4 |
| Filtering Size in Pixels | 0 |
| Filtering Size in Pixels Bad Similarity | 0 |
| Compute Normal Maps | ☐ 2 |
| Verbose Level | info |
| Meshing |
| Custom Bounding Box | ☐ 2 |
| Estimate Space From SfM | ☑ 1 |
| Min Observations For SfM Space Est. | 3 |
| Min Observations Angle For SfM Space Est. | 10 |
| Max Input Points | 50,000,000 |
| Max Points | 5,000,000 |
| Max Points Per Voxel | 1,000,000 |
| Min Step | 2 |
| Partitioning | singleBlock |
| Repartition | multiResolution |
| angleFactor | 15 |
| simFactor | 15 |
| pixSizeMarginInitCoef | 2 |
| pixSizeMarginFinalCoef | 4 |
| voteMarginFactor | 4 |
| contributeMarginFactor | 2 |
| simGaussianSizeInit | 10 |
| simGaussianSize | 10 |
| minAngleThreshold | 1 |
| Refine Fuse | ☑ 1 |
| Helper Points Grid Size | 10 |
| Densify | ☐ 2 |
| Nb Pixel Size Behind | 4 |
| Full Weight | 1 |
| Weakly Supported Surface Support | ☑ 1 |
| Add Landmarks To The Dense Point Cloud | ☐ 2 |
| Tretrahedron Neighbors Coherency Nb It. | 10 |
| minSolidAngleRatio | 0.2 |
| Nb Solid Angle Filtering Iterations | 2 |
| Colorize Output | ☐ 2 |
| Add Mask Helper Points | ☐ 2 |
| Helper Points: Mask Segment Size | 50 |
| Save Raw Dense Point Cloud | ☐ 2 |
| Export DEBUG Tetrahedralization | ☐ 2 |
| Seed | 0 |
| Verbose Level | info |
| Mesh Filtering |
| Keep Only the Largest Mesh | ☐ 2 |
| Smoothing Subset | all |
| Smoothing Boundaries Neighbors | 0 |
| Smoothing Iterations | 5 |
| Smoothing Lambda | 1 |
| Filtering Subset | all |
| Filtering Iterations | 1 |
| Filter Large Triangles Factor | 60 |
| Filter Triangles Ratio | 0 |
| Verbose Level | info |
| Texturing |
| Texture Side | 8192 |
| Texture Downscale | 2 |
| Texture File Type | png |
| Unwrap Method | Basic |
| Use UDIM | ☑ 1 |
| Fill Holes | ☐ 2 |
| Padding | 5 |
| MultiBand Downscale | 4 |
| MultiBand contributions High Freq | 1 |
| MultiBand contributions Mid-High Freq | 5 |
| MultiBand contributions Mid-Low Freq | 10 |
| MultiBand contributions Low Freq | 0 |
| Use Score | ☑ 1 |
| Best Score Threshold | 0.1 |
| Angle Hard Threshold | 90 |
| Process Colorspace | sRGB |
| Correct Exposure | ☐ 2 |
| Force Visible By All Vertices | ☐ 2 |
| Flip Normals | ☐ 2 |
| Visibility Remapping Method | PullPush |
| Subdivision Target Ratio | 0.8 |
| Verbose Level | info |
References
- Griwodz, C.; Gasparini, S.; Calvet, L.; Gurdjos, P.; Castan, F.; Maujean, B.; De Lillo, G.; Lanthony, Y. AliceVision Meshroom: An open-source 3D reconstruction pipeline. In Proceedings of the 12th ACM Multimedia Systems Conference, Istanbul, Turkey, 28 September–1 October 2021. [Google Scholar] [CrossRef]
- Yan, Q.; Xu, L.; Shi, J.; Jia, J. Hierarchical Saliency Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013. [Google Scholar] [CrossRef]
- Li, G.; Yu, Y. Visual Saliency Based on Multiscale Deep Features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar] [CrossRef]
- Wang, L.; Huchuan, L.; Wang, Y.; Mengyang, F.; Wang, D.; Yin, B.; Ruan, X. Learning to Detect Salient Objects with Image-Level Supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 June 2017. [Google Scholar] [CrossRef]
- Fan, P.; Zhang, J.; Xu, G.; Cheng, M.; Shao, L. Salient Objects in Clutter. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2344–2366. [Google Scholar] [CrossRef] [PubMed]
- Reizenstein, J.; Shapovalov, R.; Henzler, P.; Sbordone, L.; Labatut, P.; Novotny, D.; Ruan, X. Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar] [CrossRef]
- Gabara, G.; Sawicki, P. CRBeDaSet: A Benchmark Dataset for High Accuracy Close Range 3D Object Reconstruction. Remote Sens. 2023, 15, 1116. [Google Scholar] [CrossRef]
- Chierchia, R.; Lebrat, L.; Ahmedt-Aristizabal, D.; Salvado, O.; Fookes, C.; Cruz, R. SALVE: A 3D Reconstruction Benchmark of Wounds from Consumer-Grade Videos. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Tucson, AZ, USA, 26 February–6 March 2025. [Google Scholar] [CrossRef]
- Olague, G.; Menendez-Clavijo, J.A.; Olague, M.; Ocampo, A.; Ibarra-Vazquez, G.; Ochoa, R.; Pineda, R. Automated design of salient object detection algorithms with brain programming. Appl. Sci. 2022, 12, 10686. [Google Scholar] [CrossRef]
- Wang, H.; Wan, L.; Tang, H. LeNo: Adversarial robust salient object detection networks with learnable noise. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, Washington D.C., USA, 7–14 February 2023. [Google Scholar] [CrossRef]
- Wu, Y.; Liu, Y.; Zhang, L.; Cheng, M.; Ren, B. EDN: Salient object detection via extremely-downsampled network. IEEE Trans. Image Process. 2022, 31, 3125–3136. [Google Scholar] [CrossRef] [PubMed]
- Taehun, K.; Kunhee, K.; Joonyeong, L.; Dongmin, C.; Jiho, L.; Daijin, K. Revisiting Image Pyramid Structure for High Resolution Salient Object Detection. In Proceedings of the Sixteenth Asian Conference on Computer Vision, Macao, China, 4–8 December 2022. [Google Scholar] [CrossRef]
- Achanta, R.; Hemami, S.; Estrada, F.; Susstrunk, S. Frequency-tuned salient region detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar] [CrossRef]
- Lowe, D. Object recognition from local scale-invariant features. In Proceedings of the IEEE International Conference on Computer Vision, Corfu, Greece, 20–25 September 1999. [Google Scholar] [CrossRef]
- Nister, D.; Stewenius, H. Scalable Recognition with a Vocabulary Tree. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York, NY, USA, 17–22 June 2006. [Google Scholar] [CrossRef]
- Lowe, D. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Fischler, M.; Bolles, R. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
- Hartley, R.; Zisserman, A. 3D Reconstruction of Cameras and Structure. In Multiple View Geometry in Computer Vision, 2nd ed.; Cambridge University Press: Cambridge, UK, 2003; pp. 262–278. [Google Scholar]
- Hirschmuller, H. Accurate and efficient stereo processing by semi-global matching and mutual information. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–26 June 2005. [Google Scholar] [CrossRef]
- Jancosek, M.; Pajdla, T. Multi-view reconstruction preserving weakly-supported surfaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA, 20–25 June 2011. [Google Scholar] [CrossRef]
- Boykov, Y.; Kolmogorov, V. An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision. IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 1124–1137. [Google Scholar] [CrossRef] [PubMed]
- Olague, G.; Mohr, R. Optimal camera placement for accurate reconstruction. Pattern Recognit. 2002, 35, 927–944. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |