SOD3D: A Salient Object Detection Dataset for Photogrammetric 3D Reconstruction

Barrera Román, Aarón; Olague, Gustavo; Clemente, Eddie; Olague, Matthieu

doi:10.3390/data11070157

Open AccessData Descriptor

SOD3D: A Salient Object Detection Dataset for Photogrammetric 3D Reconstruction

¹

Department of Computer Science, Centro de Investigación Científica y Educación Superior de Ensenada, Ensenada 22860, Mexico

²

Department of Computer Science, Tecnológico Nacional de México/CENIDET, Cuernavaca 62490, Mexico

³

IBM Technology Campus Guadalajara, El Salto 45680, Mexico

^*

Author to whom correspondence should be addressed.

Data 2026, 11(7), 157; https://doi.org/10.3390/data11070157 (registering DOI)

Submission received: 14 April 2026 / Revised: 28 May 2026 / Accepted: 29 May 2026 / Published: 25 June 2026

(This article belongs to the Section Information Systems and Data Management)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Three-dimensional (3D) reconstruction from a photogrammetric perspective aims to infer the geometric structure of a scene from a set of images, including the recovery of depth information inherently lost during image acquisition. Conventional photogrammetric pipelines rely on multiple handcrafted processing stages, often requiring manual intervention. This work introduces a dataset designed to support the study of background removal techniques in photogrammetric workflows through salient object detection (SOD). The dataset comprises 15,120 images divided into sets of 28 distinct objects, each set including 36 high-resolution RGB images captured from multiple viewpoints. Additionally, each set provides 36 manually segmented images, as well as automatically segmented versions obtained using four different SOD algorithms. To facilitate evaluation and reproducibility, 153 reconstructed 3D models are provided across all object categories, and a 3D reconstruction evaluation methodology based on the Chamfer Distance metric is proposed, enabling the analysis of the impact of different segmentation strategies on 3D reconstruction. The dataset offers a benchmark resource for the development, comparison, and validation of methods aimed at improving photogrammetric pipelines through automated information filtering.

Dataset: https://doi.org/10.17632/695w3dws5f.2

Dataset License: CC-BY 4.0

Keywords:

photogrammetry; salient object detection; visual attention; image segmentation; background removal; brain programming

1. Summary

3D reconstruction from a photogrammetry standpoint is the process of inferring geometric features of a scene from a set of pictures. These geometric features include depth information from the original scene, which is unavoidably lost when capturing images. There is a broadly known and well-established process to perform this task called the photogrammetric pipeline, ref. [1]; nevertheless, several stages of this process require handcrafted tasks to be performed. We introduce salient object detection (SOD) as a key topic for 3D reconstruction, since this technique enables the filtering of non-relevant information from images by identifying the regions that belong to the objects of interest, which are known as the proto-object. Formally, the SOD problem is defined as follows: Given a set of images

I = \{I_{1}, I_{2}, \dots, I_{n}\}

and their corresponding proto-objects

P = \{P_{1}, P_{2}, \dots, P_{n}\}

, the SOD model establishes a relationship between an image (

I_{i}

) and the proto-object (

P_{i}

). This information filtering technique becomes convenient for feeding only useful 2D information to the photogrammetric pipeline, thus avoiding manual postprocessing in the 3D domain. Therefore, the SOD3D dataset allows researchers to evaluate the impact of an SOD stage applied to 3D reconstruction and contrast the results between multiple visual attention algorithms. In this work, we provide as examples one symbolic approach and three other sub-symbolic approaches based on deep learning (DL) techniques.

The SOD3D dataset is designed to test the application of SOD to enhance the photogrammetric 3D reconstruction; thus, we explored the state of the art in datasets related to these two problems: SOD and photogrammetric 3D reconstruction. For salient object detection, the benchmark datasets were developed in the previous decade. Ref. [2] was presented in 2015 with the main goal of addressing the limitations of earlier datasets, which were considered to be simple and not representative of real-world conditions. Later that year, ref. [3] was introduced, seeking to add complexity regarding foreground–background contrast and multiple salient objects. One of the largest and most widely used benchmarks is ref. [4], developed in 2017, claiming to be more challenging in terms of image complexity. Later, in 2022, ref. [5] was presented, introducing new challenges like blurry images, object occlusion, and background clutter, with the remarkable new consideration of having no salient objects in some of the presented images. In addition to the fact that these SOD datasets lack high-resolution images, none of them is focused on the problem of 3D reconstruction. However, more recently introduced datasets do address the problem of 3D reconstruction. Ref. [6] was presented by Meta in 2021, including over 1.5 million images from 50 different object categories, making it the most extensive dataset designed for the 3D reconstruction problem. Ref. [7] was later introduced in 2023, focused on a building with challenging textures. Introduced in 2025 for the medical field, ref. [8] addresses the problem of non-invasive monitoring of chronic wounds. Besides the fact that the camera network for acquiring the images for these 3D datasets is not standardized, none of them includes manually segmented versions of the images where the object of interest is isolated from the background, since they do not consider SOD as part of the 3D reconstruction problem. From the state of the art, we can see that the uniqueness of the proposed dataset comes from identifying an overlap and addressing critical concerns from both the SOD and 3D reconstruction problems: high-resolution imagery, standardized acquisition, and manually segmented ground truths.

In summary, the public release of this dataset offers several key advantages for researchers working in the field of photogrammetric 3D reconstruction:

To the best of our knowledge, our proposal establishes a benchmark through a high-resolution dataset of original and processed images specifically designed for studying visual attention algorithms applied to photogrammetric 3D reconstruction.
The dataset was developed by acquiring original images in a controlled and standardized manner, allowing researchers to fairly assess the performance of different SOD algorithms from a 3D reconstruction perspective.
The release of this data enables researchers to measure the impact of an SOD stage on 3D reconstruction and compare multiple visual attention algorithms: one based on a symbolic approach and three additional sub-symbolic approaches based on DL techniques.
By including both automatically produced saliency maps and manually generated ground truth images, the dataset provides the necessary baselines to fairly evaluate the performance of different SOD algorithms against a defined metric.

2. Data Description

SOD3D is a testing dataset built to evaluate the impact of an SOD stage applied to 3D reconstruction. It consists of images of 28 objects of different sizes, shapes, and textures, as well as their processed versions. Starting with 36 original photographs taken for each object from various points of view, a manually segmented ground truth is included for each original image as a baseline for SOD algorithm evaluation. Furthermore, saliency maps generated with four SOD algorithms, one symbolic, based on the brain programming methodology, and three sub-symbolic, based on DL techniques, are included. Additionally, binarized versions of each saliency map, referred to as proto-objects, are included to evaluate the performance of the SOD algorithms. Furthermore, overlays obtained by masking the original images with the proto-objects are included as an additional image set. Finally, a set of 3D reconstructions is provided in the form of point clouds. The repository’s root contains a directory for each object. Inside these directories are folders for the original images and their processed versions. The generic folder structure of the repository is depicted in Figure 1.

Inside each object’s folder, a total of six folders contain the original images, the manually segmented ground truth, the saliency maps generated by each SOD algorithm, the proto-objects derived from the saliency maps, the overlays generated from masking the original images with the proto-objects, and the final reconstructions resulting from each object after applying each proposed technique. As shown in Figure 1, the Original and GroundTruth folders contain the raw images and their manually segmented versions. Then, the SaliencyMap and ProtoObject folders share the same structure, where an independent directory is reserved for each SOD algorithm. Similarly, Overlay includes a folder for the data derived from each SOD algorithm, with the addition of a folder called GroundTruth, which contains the data derived from processing the original images using the manually segmented ground truth, to serve as a reference. Finally, the Reconstruction folder contains the available 3D reconstructions in the form of point clouds for each applied technique. Each file name includes a prefix that contains identifiers that encode the related object and the corresponding processing stage. Table 1 shows the naming conventions and file formats for each stage in the pipeline.

3. Methods

This section, divided into two parts, outlines the methodology proposed to acquire and process the data. It covers the initial setup for collecting raw data and the subsequent processing required to generate the final products included in the dataset.

3.1. Initial Setup

This section, divided into four parts, details how the images that constitute the core of the dataset were acquired. We will start by describing the camera used for the acquisition and the optical parameters selected for this purpose, then explain the environmental conditions and camera network configuration, and finally describe the objects selected for the dataset.

3.1.1. Optical Parameters

3D reconstruction in photogrammetry heavily relies on acquiring images where objects of interest appear crisp enough for the features on their surface to be properly detected, described, and identified. For this purpose, it is crucial that the whole surface of the object of interest appears sharp and in focus in each image. To meet this requirement, an appropriate camera and optical parameter configuration must be selected.

The camera setup selected for the image acquisition task was a Canon EOS 5D Mark III with a Canon EF 24–105 mm f/4 IS USM zoom lens. This camera has a full-frame sensor delivering a maximum spatial resolution of 5760 × 3840 pixels. The zoom lens can be adjusted to any focal length between 24 and 105 mm.

The depth of field is the distance between the closest and farthest objects that appear acceptably sharp in an image. To acquire images where the object of interest appears entirely focused, the depth of field must be deep enough to cover the entire object of interest, at the very least, regardless of camera position and attitude during the acquisition. Since the focus distance was set to be constant and relatively short (1 m) due to space limitations, the aperture and focal length were the parameters to be adjusted. The focal length was kept at relatively small values considering the short focus distance; however, to minimize perspective distortion, the shortest available focal length was avoided as much as possible, depending on the dimensions of the object of interest. Thus, focal length values between 24 and 35 mm were set during the acquisition. The only remaining optical parameter related to focus was the aperture, which is crucial for defining the image’s exposure.

The so-called exposure triangle comprises three key camera settings that work together to determine how light is captured and directly impact the final image exposure. The three parameters are the aperture, sensitivity, and shutter speed. The aperture consists of the actual diaphragm opening that controls the amount of light impacting the sensor. The sensitivity is the level of amplification of the amount of light detected by the sensor. The shutter speed is the period the sensor remains exposed to light during the capture. There is a trade-off between these three parameters since they are highly interrelated, and adjusting each one directly impacts the final exposure in the image.

To avoid known noise effects (e.g., grainy patterns in low-contrast areas) in the final image, sensitivity was adjusted to a low value of 200, which is also recommended for artificially illuminated building interiors. To ensure a suitable depth of field, an aperture value of f/8 was set. Low light levels resulting from the selected small aperture had to be compensated through longer exposure times, which were variable and automatically selected on each capture depending on the scene light conditions. For this purpose, the aperture priority setting was used during the whole acquisition. Finally, the high dynamic range (HDR) setting was active for each capture to ensure a homogeneous level of detail over the entire image. In this mode, three images with different exposure levels are acquired for each capture and automatically mixed to obtain an HDR image. Starting from an image with an ideal exposure value of 0, a sub-exposed image (exposure value of −3) is acquired to retain details in bright areas, and an over-exposed image (exposure value of +3) is also acquired to keep details in dark areas. Then, the three images are automatically merged into a single capture with evenly spread details across the whole image. This process is performed internally by the Canon EOS 5D Mark II as part of its HDR capabilities.

3.1.2. Camera Network Configuration

Photographs from different camera positions were acquired for each object with the purpose of recovering its three-dimensional features. Figure 2 depicts the camera network configuration used for the captures from side and top views. The camera positions were arbitrarily defined to maximize the object’s surface coverage while keeping them standard across the whole dataset despite variability in object shapes. For this purpose, 36 camera positions were defined over a 1 m radius spherical surface, where three imaginary rings were proposed: 0°, 30°, and 60° from the object’s vertical center. In the lower ring, which is 0° from the object’s vertical center (blue in Figure 2), 16 camera positions were established, 22.5° from one another. For the middle ring, which is 30° from the object’s vertical center (purple in Figure 2), 12 camera positions were established, 30° from one another. Finally, for the top ring, which is 60° from the object’s vertical center (red in Figure 2), eight camera positions were established, 45° from one another. This camera network configuration maintains a constant 1 m distance from the object’s vertical center for each photograph. Camera positions and orientations for the proposed camera network configuration were fixed manually, aided by markings located on the floor and a camera tripod with an adjustable height feature.

3.1.3. Object Selection

The dataset includes pictures of 28 different objects. The object list was carefully crafted, including inanimate objects of various sizes, shapes, and surface textures. Object sizes were defined considering space availability and optical constraints. Object heights ranging from 15 to 60 cm were selected to meet the aforementioned conditions. The shapes of the selected objects vary from low complexity for those whose shapes are similar to basic primitives (cubes, spheres, prisms, etc.) to medium to high complexity for those whose shapes are more organic (human-like, articulated, and so on). For the textures, objects made of different materials were selected to evaluate the impact of those features on the final 3D reconstruction while keeping reflective surfaces to a minimum, considering that this could cause problems during reconstruction due to poor feature extraction. Figure 3 shows example captures for the selected objects.

3.1.4. Environmental Conditions

3D reconstruction in photogrammetry heavily relies on acquiring images where the objects’ surface features are visible so they can be properly detected, described, and identified. For this purpose, a suitable environment for object photography was selected, aiming to have standard light conditions across the whole image acquisition stage. The place selected for this task was a building interior space with artificial cold white light, where the background remains static for each of the objects’ image sets. Slight changes in background configuration (e.g., background objects’ location and position) between different object sets were allowed. To ensure a variable background setting, the objects were placed on 28 paper wraps of different colors, textures, and patterns, intentionally causing consistent changes in the background between different objects. Hence, we associate each object with a unique paper wrap.

3.2. Data Processing

This section provides a description of the raw data processing procedures required to generate the final products included in the dataset. First, the background removal methodologies are presented, encompassing both manual processing for ground truth generation and automated approaches based on SOD algorithms. Next, the procedures for image binarization and overlay generation are described. Finally, the 3D reconstruction process is outlined.

3.2.1. Background Removal

This subsection is divided into three parts: ground truth, SOD algorithms, and binarization. The database includes manually segmented ground truth for each of the original images to serve the purpose of evaluating SOD algorithms. Furthermore, automatically segmented versions of the original images are included. Each set of images was processed with three sub-symbolic SOD algorithms based on DL and a symbolic SOD algorithm based on the brain programming paradigm.

Ground Truth

Image ground truth is obtained through a manual process where a contour is selected to segregate the group of pixels that belong to the object of interest (foreground) from those that do not (background). This action is performed in Adobe Photoshop using the Pen Tool. To have clear visibility between pixels, a zoom of 400% on a 24-inch Full HD monitor is used for the contour. After contouring the object, the loop between the initial and final points is closed. The contoured closed loop groups the pixels that belong to the foreground, while the excluded pixels comprise the background. Finally, a layer mask is applied to the loop so that the regions that constitute the object of interest appear in white, while the remaining areas appear in black in the final image. For some objects, multiple contours might be needed depending on their shape. This process can take up to 15 min per image, requiring up to 9 h to process each object’s set of 36 images, depending on the object’s complexity. Figure 4 depicts the ground truth creation process, which essentially lies in defining which pixels belong to the object of interest and distinguishing them from those that constitute the background. As shown in Figure 4, there is a trade-off between pixel detail and context, which is essential for properly selecting regions. As the zoom increases, independent pixels can be accurately identified, but the context of the image regions is lost.

SOD Algorithms

This section compares a symbolic algorithm under the brain programming paradigm with three different sub-symbolic algorithms based on DL techniques.

GBVSBP is an evolutionary computation paradigm that mimics the inner workings of the visual cortex, ref. [9]. Specifically, to solve the SOD task, the dorsal stream functions are replicated to create computational models able to segregate regions of the image that belong to the object of interest from those that belong to the background. The strategy adheres to a goal-oriented paradigm wherein learning is conceptualized as a symbolic optimization, whereby an individual is characterized by a template that delineates the visual attention (VA) model while simultaneously identifying essential components of the algorithm via artificial evolution. We selected this method because it represents a symbolic approach to SOD.

Introduced in 2022, adversarial robust SOD networks with a learnable noise (LeNo) module consist of a shallow noise inspired by the VA mechanism embedded in the encoder, initialized with a cross-shaped Gaussian distribution, and a noise estimation affecting only a single channel of the decoder rather than adding more network elements for postprocessing, ref. [10]. This technique contributes to the SOD’s increased robustness by outperforming earlier research while also improving inference speed. We selected this network due to the authors’ claim that LeNo is robust to noise.

Proposed in 2022, SOD via the Extremely Downsampled Network (EDN) applies a strategy to learn a global view of the complete image, resulting in precise salient object localization, ref. [11]. A scale-correlated pyramid convolution is constructed to improve multi-level feature fusion and recover object details from the extreme downsampling, achieving state-of-the-art performance and real-time speed. We selected this network to test its repeatability against high-resolution images.

The Inverse Saliency Pyramid Reconstruction Network (InSPyReNet) was added to our work due to its continued competitive performance against more recent SOD models, offering an effective balance between state-of-the-art accuracy and computational efficiency under the resource constraints and testing methodology of this study. It was proposed in 2022 and introduced an image pyramid-based framework for high-resolution (HR) SOD without requiring HR training datasets, ref. [12]. The network produces a strict image pyramid structure of a saliency map that enables pyramid-based image blending, achieved with a dedicated method that synthesizes low-resolution (LR) and HR image pyramids to mitigate effective receptive field discrepancies during HR prediction. Multiple SOD metrics and boundary accuracy measure evaluations performed on public LR and HR SOD benchmarks demonstrated that InSPyReNet surpassed state-of-the-art methods.

To establish a baseline, the image database used to train the aforementioned models is known as FT (frequency tuned). This database was initially introduced by Achanta et al. and is still used to study the SOD problem. FT’s ground truth is obtained by performing a manual object-contour segmentation over images of animate and inanimate objects, ref. [13]. We selected this database due to the similarity to our dataset, although we use high-resolution images of inanimate objects captured from different perspectives. This aspect is relevant since the information registered in our dataset contains projections of a three-dimensional scene, while all other datasets studied within SOD portray an image processing problem.

Binarization

This section explains the process needed to convert the standard output of an SOD algorithm, which is a grayscale image, to a binary output. Depending on the type of algorithm, the resulting image can have different characteristics. For the case of algorithms based on DL, which are unable to work with images at their original size, the input is first scaled to be fed to the neural networks. This unavoidably causes the aspect ratio to be lost and spatial distortion to be introduced. Naturally, the output is also affected by these transformations, resulting in the original salient object image being a square-scaled version of the expected output. At the end of the process, the algorithms based on DL rescale the output image to the size of the original input image, recovering the aspect ratio but introducing additional distortion as a result of this operation. Furthermore, when stretching the original output, some spatial properties of the image are lost since object edges change from regions where a radical transition between white and black pixels is observed to gradients with multiple shades between black and white.

The symbolic SOD algorithm based on the brain programming paradigm is able to work with images at their original size, and its output is a saliency map, which is a grayscale image where multiple shades of gray denote different attention levels. Thus, regions marked in black indicate the lowest attention level, while those marked in white indicate the highest attention level. Any value in between indicates an intermediate attention level, with brighter values representing higher attention and darker values representing lower attention.

Considering the nature of the output images for each SOD algorithm (saliency maps), an image binarization process is necessary to generate the corresponding proto-objects. Hence, the regions in the image that belong to the salient object appear in white, while those that belong to the background appear in black. To this end, a binarization level is selected to establish the threshold for setting a pixel to white or black in the final image. To establish a fair binarization level for each algorithm output, an optimal threshold is selected by finding the value that maximizes the overlap between the binarized saliency map and the manually segmented ground truth. Algorithm 1 explains how the optimal threshold is defined and used to generate the corresponding proto-object from an input saliency map.

Algorithm 1: Generate proto-object

Purpose: Create a proto-object by binarizing a saliency map with the optimal threshold.
Input:

saliency_map: Grayscale image to be binarized.
groundtruth: Manually segmented binary ground truth.

Functions:

Binarize (image, threshold): Returns the input image binarized by the input threshold.
FMeasure (image1, image2): Returns the F-Measure value from the two input binary images.

Variables:

threshold: Stores a value used for binarization.
optimal_threshold: Stores the best threshold.
f_measure: Stores an F-Measure result.
max_f_measure: Stores the maximum F-Measure.
binarized_image: Stores a binarized image.

Output:

proto_object: Proto-object generated by binarizing an input image by the optimal threshold.

1:    f_measure ← 0
2:    max_f_measure ← 0
3:    best_threshold ← 0
4:    for (threshold ← 1 to 255) do
5: binarized_image ← Binarize (saliency_map, threshold)
6: f_measure ← FMeasure (binarized_image, groundtruth)
7: if f_measure > max_f_measure then
8: max_f_measure ← f_measure
best_threshold ← threshold
9: end if
10:    end for
11:    proto_object ← Binarize (saliency_map, threshold)
12:    return proto_object

More formally, considering the ground truth image

G_{n}

, where

\{n : 1 \leq n \leq 36\}

and a proto-object

P_{m}

, where

\{m : 0 \leq m \leq 255\}

, it is necessary to find the value that maximizes the overlap between

G_{n}

and

P_{m}

. For this, we use the

F_{β}

measure defined as follows:

F_{β} = \frac{(1 + β^{2}) p \cdot r}{β^{2} p + r}

where

p

is precision

\{p : 0 \leq p \leq 1\}

,

r

is recall

\{r : 0 \leq r \leq 1\}

, and

β

is the parameter that controls the balance between

p

and

r

,

\{β : 0 \leq β \leq \infty\}

.

F_{β}

measures the effectiveness of the overlap considering

β^{2} = 0.3

to emphasize precision following the standard protocol [9]. Then, the overlap between

G_{n}

and

P_{m}

is given by the following relation, considering the pair

(G_{n}, P_{m})

and

m

thresholds:

{a r g m a x}_{P} (F_{β} (G_{n = 1}, P_{m}))

The maximum argument

P_{m}

that maximizes the function

F_{β}

is then calculated. We repeat this process for the

n

different ground truths corresponding to each image.

Figure 5 shows an example of an image processed with the four different SOD algorithms. There, a clear performance comparison between automatic segmentation techniques can be observed.

3.2.2. Overlay Generation

The primary purpose of the SOD3D dataset is to evaluate the impact of an SOD stage applied to 3D reconstruction. To achieve this, the proto-objects are then used to mask the original images acquired by the camera. As a result, we obtain an image in which all the regions belonging to the background are masked in black, and those that belong to the object of interest retain their original pixel values. These images are called overlays and are the ones to be fed to the photogrammetric pipeline for 3D reconstruction. Figure 6 shows an example of an overlay generated by masking an original image with a proto-object.

3.2.3. 3D Reconstruction

The SOD3D dataset was developed to measure the impact of an SOD stage applied to 3D reconstruction. To this aim, the database includes 3D point clouds that represent the reconstructed objects using the traditional photogrammetry pipeline from each applicable set of images.

The fundamental principle of the photogrammetry pipeline relies on obtaining multiple views of the scene and then triangulating the locations between matching points to estimate their 3D coordinates. This process is incrementally extended to additional views with the aim of constructing a 3D point cloud. The default workflow in Meshroom was used to apply the photogrammetric pipeline over the input images to perform the 3D reconstruction. Based on this workflow, this pipeline can be divided into some critical stages, which are outlined below:

Feature extraction: This stage establishes the foundation for finding the relative pose of cameras in 3D space. Its primary goal is to extract distinctive groups of pixels that remain robust to viewpoint variations during image acquisition. Consequently, a scene feature should produce consistent descriptors across all captured images. The SIFT (Scale-Invariant Feature Transform) algorithm, ref. [14], is employed to extract keypoints and generate a set of image feature descriptors. Texture complexity may vary significantly both across different images and within local regions of the same image, causing the number of extracted features to fluctuate considerably. To address this issue, a post-filtering step is applied to limit the total number of features to a reasonable amount. Finally, grid filtering is used to ensure a homogeneous distribution of feature points throughout the image.
Image matching: The objective of this stage is to identify visual overlap between images, based on their shared feature content. To achieve this, image retrieval techniques are employed to identify images with overlapping content, avoiding the high computational cost of exhaustive feature matching. This is accomplished by representing each image as a compact descriptor, allowing for highly efficient distance evaluations between images. A widely used method for generating global image descriptors is the vocabulary tree approach, ref. [15]. Extracted image feature descriptors are propagated down the tree, hierarchically classified at each node, and assigned to a specific leaf node. Each descriptor is thus reduced to a simple leaf index, and the final image descriptor is then formed by collecting these leaf indices. Finally, the shared feature content between images can be evaluated by comparing these descriptors.
Feature matching: In this stage, features are matched between image pairs, which will subsequently provide the geometric foundation to reconstruct an initial 3D structure from the underlying 2D data. First, photometric matching between the descriptors of the two input images is performed. Each feature in image A is mapped to a list of candidate features from image B. This list is then refined under the assumption that only one valid match exists between the images. Specifically, for each descriptor in the first image, its two closest neighbors in the second image are identified, and a relative distance ratio threshold, ref. [16], is applied. This process yields a list of candidate matches based purely on photometric criteria. Since finding the two closest descriptors for each feature is computationally expensive using a brute-force approach, optimized algorithms are typically employed; while Approximate Nearest Neighbor is the most common, alternatives like Cascading Hashing are also widely used. Geometric filtering is performed using epipolar geometry within the random sample consensus (RANSAC) framework for outlier rejection, ref. [17]. By randomly selecting a small set of feature correspondences, the fundamental matrix is computed. Then the number of features that conform to this geometric model is evaluated and iterated through the RANSAC loop to find the optimal consensus set.
Structure from motion: This stage represents the core of the whole photogrammetry pipeline. It analyzes the geometric relationships between 2D input images to reconstruct a 3D model of the scene and simultaneously determine the pose and internal calibration of all cameras. The incremental pipeline is an iterative reconstruction process that begins with an initial two-view reconstruction and sequentially extends it by integrating new views. To initiate this, feature matches across image pairs are fused into tracks, where each track ideally represents a unique 3D point in space observed by multiple cameras. However, at this stage of the pipeline, the tracks still contain a significant number of outliers. To mitigate this, inconsistent tracks are filtered out during the fusion process. Next, the incremental pipeline must select the optimal initial image pair, which is crucial for a high-quality final reconstruction. This baseline pair must provide robust feature matches and guarantee reliable geometric constraints. Consequently, the selection process prioritizes pairs that maximize both the total match count and the spatial distribution of these features across the image planes. Simultaneously, the baseline must maintain a sufficiently wide triangulation angle between the camera views to ensure robust 3D triangulation. Then, the fundamental matrix between these two images is computed, setting the first camera as the origin of the 3D coordinate system. With the relative pose of the first two cameras established, the corresponding 2D feature matches are triangulated to generate the initial set of 3D points. Subsequently, the pipeline selects new images that share a sufficient number of correspondences with the existing 3D point cloud. This process is known as Next-Best-View selection. Utilizing these 2D–3D associations, the algorithm performs camera resectioning to estimate the pose of each new view. This resectioning step employs a Perspective-n-Point algorithm embedded within a RANSAC framework to robustly determine the camera pose that yields the highest consensus of feature matches. A final non-linear minimization step is then applied to refine each camera pose. Following the estimation of these new camera poses, tracks that become visible across two or more resected views are triangulated into 3D points. Next, a global Bundle Adjustment is executed to optimize all parameters together, including the camera intrinsics, extrinsics, and 3D point positions. To maintain reconstruction accuracy, the Bundle Adjustment results are filtered by removing observations that exhibit a high reprojection error or an insufficient triangulation angle. The triangulation of these new points expands the available candidate images for subsequent Next-Best-View selection. This process is executed iteratively: integrating new camera views, triangulating newly observed 2D features into 3D points, and filtering out any 3D points that become geometrically invalidated. This optimization loop continues until no remaining camera views can be localized, ref. [18].
Depth map estimation: In this stage, a depth value for each input pixel is estimated. This is achieved by analyzing the similarities between neighboring cameras in volumetric regions around the intersection of their optical axes in the 3D space. For each reference image, the $N$ nearest neighboring cameras are selected. The fronto-parallel planes are selected based on the intersection of the reference optical axis with the pixels of these neighboring cameras. This plane-sweeping approach constructs a matching volume of dimensions $W$ , $H$ , $Z$ , representing multiple depth candidates per pixel. The matching similarity for all depth candidates is then evaluated using zero-mean normalized cross-correlation computed over a small patch from the reference image reprojected onto the neighboring views. This process generates a raw similarity volume, where photometric costs from each neighboring view are accumulated. Because this volume is inherently noisy, a spatial filtering step is applied along the $X$ and $Y$ axes to aggregate local costs. This filtering effectively suppresses isolated high-similarity outliers. From this regularized volume, the optimal depth is determined by selecting the local minima, mapping the chosen plane indices to their corresponding continuous depth values to populate an initial depth map. Since this depth map is restricted to the discrete intervals of the sampled planes, it exhibits banding artifacts. To resolve this, a sub-pixel refinement step is applied to achieve continuous and precise depth estimations. Finally, a filtering step is applied to ensure consistency between multiple cameras, ref. [19].
Meshing: This stage merges all depth maps into a single dense point cloud and then establishes a relationship between the points to create a surface. First, the individual depth maps are fused into a global octree structure, where geometrically compatible depth values are merged into unified octree cells. Then a 3D Delaunay tetrahedralization is performed over the resulting point cloud. Next, a visibility-based voting procedure is executed to compute weights for both the tetrahedral cells and the facets connecting them, ref. [20]. A Graph Cut Max-Flow algorithm, ref. [21], is subsequently applied to solve for the optimal volumetric segmentation, where the minimum cut defines the extracted manifold mesh surface. After discarding noisy or isolated cells along the boundary, Laplacian filtering is applied to suppress localized mesh artifacts. Finally, a mesh simplification step is performed to drastically reduce the vertex count while preserving structural features.

The 3D point clouds included in the SOD3D dataset are generated from the original images and the overlays obtained by masking them with the ground truth and the different proto-objects derived from the SOD algorithms. In this manner, six different 3D reconstructions for each object are expected to be generated from each set of images: the original images, the ground truth overlay, the GBVSBP overlay, the LeNo overlay, the EDN overlay, and the InSPyReNet overlay. Appendix B provides details on the parameter configuration used at each stage of the photogrammetry pipeline in Meshroom to generate the corresponding point clouds.

Note that certain image sets lack sufficient identifiable features for full reconstruction; consequently, there is no one-to-one correspondence between image sets and 3D point clouds. Despite these limitations, the dataset includes 153 reconstructed 3D models that represent the full diversity of the objects. Figure 7 illustrates examples of 3D models derived from the dataset’s image sets via a traditional photogrammetry pipeline.

4. Evaluation Metrics

This section outlines the proposed framework used to evaluate the quality of the 3D reconstructions generated by each method. It details the entire procedure, beginning with point cloud preprocessing to ensure a fair comparison and concluding with the specific metrics used for the assessment.

The 3D reconstructions generated by the methods described in the previous sections consist of points positioned and interconnected in 3D space. Assessing the quality of these point clouds requires a direct comparison against a known reference. For this purpose, we generated a 3D ground-truth point cloud by reconstructing a specialized set of image overlays, which were generated by masking the original images with the manually segmented ground truth.

To evaluate the geometric similarity between the reconstructed and reference point clouds, we measure the alignment of their 3D coordinates. The metric selected for this purpose is the Chamfer Distance (CD), which computes the average bidirectional nearest-neighbor distance between the two point sets. Due to its point-wise nature, CD is able to operate despite differences in point quantities between the clouds. Evaluating our 3D reconstructions via CD requires a preliminary alignment and registration step. Because CD does not inherently account for differences in pose—a condition present in our 3D reconstructions—the point clouds must share the same coordinate space and orientation to maximize point-to-point correspondence and guarantee a fair comparison. To achieve this, an initial orientation is calculated using Principal Component Analysis (PCA), a dimensionality reduction technique employed to identify the principal axes of the reconstructed point cloud for global coarse alignment. The alignment is then locally refined using the Iterative Closest Point (ICP) algorithm, which transforms the reconstructed point cloud to maximize correspondence with the target reference. Both PCA and ICP algorithms were implemented in Python 3.10.16, aided by the NumPy 1.26.3, Pandas 2.1.4, and Open3D 0.19.0 open-source libraries. It is important to note that a key limitation of the ICP algorithm is its high sensitivity to initial conditions; poor initial positioning of the source relative to the target point cloud can easily cause the algorithm to converge to a local minimum rather than the global optimum. To overcome this restriction and improve registration accuracy, we manually reoriented specific reconstructed point clouds prior to the automatic alignment steps. This manual operation was performed using CloudCompare 2.13.2, an open-source application. Once both the source 3D reconstructed and target reference point clouds achieve their final registered configuration, CD is finally computed to assess their geometric similarity.

Table 2 compares the proposed 3D ground truth with the reconstructed point clouds across all objects and techniques.

The comparison results between the proposed 3D ground truth and the reconstructed point clouds are summarized in Table 2. The CD metric measures the average distance between the two point clouds; therefore, a lower value indicates higher similarity, while a higher value denotes greater discrepancy. Note that there are SOD algorithms where the CD metric is marked as “N/A” (not available) for some of the objects. This inconsistency arises from the lack of identifiable features in certain image sets, primarily due to poor segmentation, which prevents successful 3D reconstruction; consequently, no metric can be reported.

5. User Notes

This section addresses the key limitations of our dataset, including the effects of arbitrarily defining the camera positions, the controlled environment, and the SOD model selection and training.

This dataset is focused on photogrammetric 3D reconstruction, whose primary goal is to recover 3D geometric features from a set of scene images. For this purpose, 36 images acquired from different camera positions are included in the dataset for each object to be reconstructed in 3D. The camera positions were arbitrarily defined, aiming to maximize the object’s surface coverage while keeping them standard across the whole dataset, despite the variety of object shapes. This represents a drawback for accurate 3D reconstruction since, according to ref. [22], carefully selecting these camera positions and attitudes minimizes errors in 3D geometric features, thereby enhancing 3D reconstruction quality. Future work can explore the correct camera network design according to each object’s morphology. Nevertheless, the creation of this dataset aims to study the impact of SOD on the 3D reconstruction; hence, this limitation does not void its value.

The original images were acquired under a controlled environment, where the illumination, distance, and camera positions were carefully selected. The dataset can be expanded to include pictures acquired under different environmental conditions (e.g., camera distances and positions, exterior environments, natural light, and so on) to measure their impact on the final 3D reconstruction.

In addition to the manually segmented ground truth for each original image, this dataset includes automatically segmented versions. These automatically segmented versions of the images were created using symbolic and sub-symbolic SOD algorithms. The FT image database was used to train each SOD approach applied in our work to establish a baseline. Further testing can be performed by diversifying the SOD algorithms applied to create this dataset and by training them with different databases.

Author Contributions

Conceptualization, A.B.R., G.O. and E.C.; methodology, A.B.R., G.O. and E.C.; software, A.B.R. and M.O.; validation, A.B.R., G.O., E.C. and M.O.; formal analysis, A.B.R., G.O. and E.C.; investigation, A.B.R.; resources, G.O. and E.C.; data curation, A.B.R. and M.O.; writing—original draft preparation, A.B.R.; writing—review and editing, A.B.R., G.O., E.C. and M.O.; visualization, A.B.R.; supervision, G.O. and E.C.; project administration, G.O. and E.C.; funding acquisition, G.O. and E.C. All authors have read and agreed to the published version of the manuscript.

Funding

The APC was funded by Centro de Investigación Científica y Educación Superior de Ensenada (CICESE), No. 31830 and Tecnológico Nacional de México/CENIDET, No. 24345.26-P.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this work is available at https://doi.org/10.17632/695w3dws5f.2 under a Creative Commons Attribution (CC BY 4.0) international license.

Acknowledgments

The authors dedicate this work to the memory of our friend, colleague and mentor, Gustavo Olague. His scientific vision, guidance, generosity, and passion for research profoundly influenced our professional and personal development. His legacy continues to inspire our work and will remain an enduring source of motivation for future generations of researchers.

Conflicts of Interest

Author Matthieu Olague was employed by the company IBM Technology, Campus Guadalajara. However, there is no financial or any other type of relationship between the company and this research work. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SOD	Salient Object Detection
DL	Deep Learning
HDR	High Dynamic Range
VA	Visual Attention
HR	High Resolution
LR	Low Resolution
CD	Chamfer Distance
PCA	Principal Component Analysis
ICP	Iterative Closest Point
SIFT	Scale-Invariant Feature Transform
RANSAC	Random Sample Consensus

Appendix A. Dataset Overview

This appendix provides an overview of the dataset. Table A1 includes the subject area, data format, how and where the data was collected, and how it can be accessed.

Table A1. Dataset specifications table.

Subject	Computer Sciences
Specific subject area	3D Reconstruction, Photogrammetry, Image Segmentation. Computer Vision and Pattern Recognition, Applied Machine Learning, Computer Graphics, and Computer-Aided Design.
Type of data	Image (RGB JPG, Grayscale PNG, Binary PNG), 3D point cloud (CSV)
Data collection	This repository contains a total of 15,120 and 153 reconstructed 3D models. The dataset is divided into 28 sets of high-resolution images of different everyday objects. Each set comprises 36 images of each object, acquired from various camera positions with the primary goal of reconstructing the object through photogrammetric techniques. The 1008 original images were taken with a Canon EOS 5D Mark III camera and a 24–105 mm variable lens, keeping optical parameters as fixed as possible to standardize image acquisition. To evaluate salient object detection (SOD) algorithms, the database includes manually segmented ground truth for each original image (Adobe Photoshop CS6 13.0). Furthermore, the repository contains automatically segmented versions of the images processed with three state-of-the-art deep learning (DL)-based SOD algorithms (Python 3.10.16) and an additional set of images processed with a symbolic SOD algorithm based on the brain programming paradigm (Matlab R2017b). Finally, for 3D reconstructions, the database includes 3D point clouds for each object (Meshroom-2021.1.0).
Data source location	The FT database was recovered from [13]. The SOD3D database was collected from an office building located in Mexicali, Baja California, Mexico.
Data accessibility	Repository name: SOD3D: A salient object detection dataset for photogrammetric 3D reconstruction Data identification number: 10.17632/695w3dws5f.2 Direct URL to data: https://data.mendeley.com/datasets/695w3dws5f/2 (accessed on 26 May 2026) Instructions for accessing these data: The dataset is available at Mendeley Data.

Appendix B. 3D Reconstruction Parameters

This appendix provides details on the parameter configuration used for generating the 3D point clouds from each set of images in Meshroom-2021.1.0. Table A2 includes the name of each stage in shaded cells and the name and value for each parameter.

Table A2. 3D reconstruction parameter configuration.

Camera Init
Sensor Database	cameraSensors.db
Default Field Of View	45
Group Camera Fallback	Folder
Allowed Camera Models	Pinhole, radial1, radial3, brown, fisheye4, fisheye1
Apply internal white balance	☑ ¹
ViewId Method	metadata
Verbose Level	info
Feature Extraction
Describer Types	sift
Describer Density	ultra
Describer Quality	ultra
Contrast Filtering	GridSort
Grid Filtering	☑ ¹
Force CPU Extraction	☑ ¹
Max Nb Threads	0
Verbose Level	info
Image Matching
Method	VocabularyTree
Voc Tree: Tree	vlfeat_K80L3.SIFT.tree
Voc Tree: Minimal Number of Images	200
Voc Tree: Max Descriptors	500
Voc Tree: Nb Matches	50
Max Nb Threads	0
Verbose Level	info
Feature Matching
Photometric Matching Method	ANN_L2
Geometric Estimator	acransac
Geometric Filter Type	Fundamental_matrix
Distance Ratio	0.8
Max Iteration	2048
Geometric Validation Error	0
Known Poses Geometric Error Max	5
Max Matches	0
Save Putative Matches	☐ ²
Cross Matching	☐ ²
Guided Matching	☐ ²
Match From Known Camera Poses	☐ ²
Export Debug Files	☐ ²
Verbose Level	info
Structure From Motion
Localizer Estimator	acransac
Observation Constraint	Basic
Localizer Max Ransac Iterations	4096
Localizer Max Ransac Error	0
Lock Scene Previously Reconstructed	☐ ²
Local Bundle Adjustment	☑ ¹
LocalBA Graph Distance	1
Maximum Number of Matches	0
Minimum Number of Matches	0
Min Input Track Length	2
Min Observation For Triangulation	2
Min Angle For Triangulation	3
Min Angle For Landmark	2
Max Reprojection Error	4
Min Angle Initial Pair	5
Max Angle Initial Pair	40
Use Only Matches From Input Folder	☐ ²
Use Rig Constraint	☑ ¹
Force Lock of All Intrinsic Camera Params.	☐ ²
Filter Track Forks	☐ ²
Inter File Extension	.abc
Verbose Level	info
Prepare Dense Scene
Output File Type	exr
Save Metadata	☑ ¹
Save Matrices Text Files	☐ ²
Correct images exposure	☐ ²
Verbose Level	info
Depth Map
Downscale	2
Min View Angle	2
Max View Ange	70
SGM: Nb Neighbor Cameras	10
SGM: WSH	4
SGM: GammaC	5.5
SGM: GammaP	8
Refine: Nb Neighbor Cameras	6
Refine: Number of Samples	150
Refine: Number of Depths	31
Refine: Number of Iterations	100
Refine: WSH	3
Refine: Sigma	15
Refine: GammaC	15.5
Refine: GammaP	8
Refine: Tc or Rc pixel size	☐ ²
Export Intermediate Results	☐ ²
Number of GPUs	0
Verbose Level	Info
Depth Map Filter
Min View Angle	2
Max View Ange	70
Number of Nearest Cameras	10
Min Consistent Cameras	3
Min Consistent Cameras Bad Similarity	4
Filtering Size in Pixels	0
Filtering Size in Pixels Bad Similarity	0
Compute Normal Maps	☐ ²
Verbose Level	info
Meshing
Custom Bounding Box	☐ ²
Estimate Space From SfM	☑ ¹
Min Observations For SfM Space Est.	3
Min Observations Angle For SfM Space Est.	10
Max Input Points	50,000,000
Max Points	5,000,000
Max Points Per Voxel	1,000,000
Min Step	2
Partitioning	singleBlock
Repartition	multiResolution
angleFactor	15
simFactor	15
pixSizeMarginInitCoef	2
pixSizeMarginFinalCoef	4
voteMarginFactor	4
contributeMarginFactor	2
simGaussianSizeInit	10
simGaussianSize	10
minAngleThreshold	1
Refine Fuse	☑ ¹
Helper Points Grid Size	10
Densify	☐ ²
Nb Pixel Size Behind	4
Full Weight	1
Weakly Supported Surface Support	☑ ¹
Add Landmarks To The Dense Point Cloud	☐ ²
Tretrahedron Neighbors Coherency Nb It.	10
minSolidAngleRatio	0.2
Nb Solid Angle Filtering Iterations	2
Colorize Output	☐ ²
Add Mask Helper Points	☐ ²
Helper Points: Mask Segment Size	50
Save Raw Dense Point Cloud	☐ ²
Export DEBUG Tetrahedralization	☐ ²
Seed	0
Verbose Level	info
Mesh Filtering
Keep Only the Largest Mesh	☐ ²
Smoothing Subset	all
Smoothing Boundaries Neighbors	0
Smoothing Iterations	5
Smoothing Lambda	1
Filtering Subset	all
Filtering Iterations	1
Filter Large Triangles Factor	60
Filter Triangles Ratio	0
Verbose Level	info
Texturing
Texture Side	8192
Texture Downscale	2
Texture File Type	png
Unwrap Method	Basic
Use UDIM	☑ ¹
Fill Holes	☐ ²
Padding	5
MultiBand Downscale	4
MultiBand contributions High Freq	1
MultiBand contributions Mid-High Freq	5
MultiBand contributions Mid-Low Freq	10
MultiBand contributions Low Freq	0
Use Score	☑ ¹
Best Score Threshold	0.1
Angle Hard Threshold	90
Process Colorspace	sRGB
Correct Exposure	☐ ²
Force Visible By All Vertices	☐ ²
Flip Normals	☐ ²
Visibility Remapping Method	PullPush
Subdivision Target Ratio	0.8
Verbose Level	info

¹ ☑ Active parameter. ² ☐ Inactive parameter.

References

Griwodz, C.; Gasparini, S.; Calvet, L.; Gurdjos, P.; Castan, F.; Maujean, B.; De Lillo, G.; Lanthony, Y. AliceVision Meshroom: An open-source 3D reconstruction pipeline. In Proceedings of the 12th ACM Multimedia Systems Conference, Istanbul, Turkey, 28 September–1 October 2021. [Google Scholar] [CrossRef]
Yan, Q.; Xu, L.; Shi, J.; Jia, J. Hierarchical Saliency Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013. [Google Scholar] [CrossRef]
Li, G.; Yu, Y. Visual Saliency Based on Multiscale Deep Features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar] [CrossRef]
Wang, L.; Huchuan, L.; Wang, Y.; Mengyang, F.; Wang, D.; Yin, B.; Ruan, X. Learning to Detect Salient Objects with Image-Level Supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 June 2017. [Google Scholar] [CrossRef]
Fan, P.; Zhang, J.; Xu, G.; Cheng, M.; Shao, L. Salient Objects in Clutter. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2344–2366. [Google Scholar] [CrossRef] [PubMed]
Reizenstein, J.; Shapovalov, R.; Henzler, P.; Sbordone, L.; Labatut, P.; Novotny, D.; Ruan, X. Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar] [CrossRef]
Gabara, G.; Sawicki, P. CRBeDaSet: A Benchmark Dataset for High Accuracy Close Range 3D Object Reconstruction. Remote Sens. 2023, 15, 1116. [Google Scholar] [CrossRef]
Chierchia, R.; Lebrat, L.; Ahmedt-Aristizabal, D.; Salvado, O.; Fookes, C.; Cruz, R. SALVE: A 3D Reconstruction Benchmark of Wounds from Consumer-Grade Videos. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Tucson, AZ, USA, 26 February–6 March 2025. [Google Scholar] [CrossRef]
Olague, G.; Menendez-Clavijo, J.A.; Olague, M.; Ocampo, A.; Ibarra-Vazquez, G.; Ochoa, R.; Pineda, R. Automated design of salient object detection algorithms with brain programming. Appl. Sci. 2022, 12, 10686. [Google Scholar] [CrossRef]
Wang, H.; Wan, L.; Tang, H. LeNo: Adversarial robust salient object detection networks with learnable noise. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, Washington D.C., USA, 7–14 February 2023. [Google Scholar] [CrossRef]
Wu, Y.; Liu, Y.; Zhang, L.; Cheng, M.; Ren, B. EDN: Salient object detection via extremely-downsampled network. IEEE Trans. Image Process. 2022, 31, 3125–3136. [Google Scholar] [CrossRef] [PubMed]
Taehun, K.; Kunhee, K.; Joonyeong, L.; Dongmin, C.; Jiho, L.; Daijin, K. Revisiting Image Pyramid Structure for High Resolution Salient Object Detection. In Proceedings of the Sixteenth Asian Conference on Computer Vision, Macao, China, 4–8 December 2022. [Google Scholar] [CrossRef]
Achanta, R.; Hemami, S.; Estrada, F.; Susstrunk, S. Frequency-tuned salient region detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar] [CrossRef]
Lowe, D. Object recognition from local scale-invariant features. In Proceedings of the IEEE International Conference on Computer Vision, Corfu, Greece, 20–25 September 1999. [Google Scholar] [CrossRef]
Nister, D.; Stewenius, H. Scalable Recognition with a Vocabulary Tree. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York, NY, USA, 17–22 June 2006. [Google Scholar] [CrossRef]
Lowe, D. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Fischler, M.; Bolles, R. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
Hartley, R.; Zisserman, A. 3D Reconstruction of Cameras and Structure. In Multiple View Geometry in Computer Vision, 2nd ed.; Cambridge University Press: Cambridge, UK, 2003; pp. 262–278. [Google Scholar]
Hirschmuller, H. Accurate and efficient stereo processing by semi-global matching and mutual information. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–26 June 2005. [Google Scholar] [CrossRef]
Jancosek, M.; Pajdla, T. Multi-view reconstruction preserving weakly-supported surfaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA, 20–25 June 2011. [Google Scholar] [CrossRef]
Boykov, Y.; Kolmogorov, V. An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision. IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 1124–1137. [Google Scholar] [CrossRef] [PubMed]
Olague, G.; Mohr, R. Optimal camera placement for accurate reconstruction. Pattern Recognit. 2002, 35, 927–944. [Google Scholar] [CrossRef]

Figure 1. Repository organization and file naming conventions for the SOD3D dataset.

Figure 2. Proposed camera network configuration: side (A) and top (B) views.

Figure 3. Example captures from the SOD3D dataset.

Figure 4. Manual ground truth generation example.

Figure 5. Comparison between saliency maps and proto-objects.

Figure 6. When an original image (A) is masked by its proto-object (B), an image overlay (C) is generated.

Figure 7. Examples of 3D reconstructed objects.

Table 1. File naming conventions and formats.

File	File Naming Convention	File Format
Original capture	Object_Orig_FileName.jpg	RGB JPG image (5760 × 3840 px)
Ground truth	Object_GT_FileName.png	Binary PNG image (5760 × 3840 px)
Saliency map	Object_SM_FileName.png	Grayscale PNG image (variable)
Proto-object	Object_PO_FileName.png	Grayscale PNG image (5760 × 3840 px)
Overlay	Object_OL_FileName.jpg	RGB JPG image (5760 × 3840 px)
Reconstruction	Object_3D_FileName.csv	3D coordinates CSV

Table 2. Point cloud quality evaluation.

Object	Algorithm	Chamfer Distance
Spidey	EDN	0.12493001
	LeNo	0.10088636
	InSPyReNet	N/A *
	GBVSBP	0.21061425
Briefcase	EDN	0.1155459
	LeNo	0.08756472
	InSPyReNet	0.09247724
	GBVSBP	0.04844216
Clock	EDN	N/A *
	LeNo	N/A *
	InSPyReNet	N/A *
	GBVSBP	0.07456226
Kitty	EDN	0.25639119
	LeNo	0.25939732
	InSPyReNet	N/A *
	GBVSBP	0.24165641
LunchBox	EDN	0.03899479
	LeNo	0.40911003
	InSPyReNet	0.42182701
	GBVSBP	0.19837973
PowerSupply	EDN	0.02840842
	LeNo	0.08677601
	InSPyReNet	0.20573723
	GBVSBP	0.02240334
Computer	EDN	0.10303119
	LeNo	0.10168622
	InSPyReNet	N/A *
	GBVSBP	0.11541542
Scope	EDN	0.06051309
	LeNo	0.30093739
	InSPyReNet	0.28118946
	GBVSBP	0.08124034
SoapBox	EDN	0.53130419
	LeNo	0.48948671
	InSPyReNet	0.42890567
	GBVSBP	0.25665115
CoffeeCreamer	EDN	0.39874113
	LeNo	0.04602422
	InSPyReNet	0.08920528
	GBVSBP	0.05905232
WineBox	EDN	0.36381791
	LeNo	0.23733872
	InSPyReNet	0.22208681
	GBVSBP	0.01867313
Helmet	EDN	0.21555422
	LeNo	0.13328942
	InSPyReNet	N/A *
	GBVSBP	0.27745244
CookieBox	EDN	0.04638342
	LeNo	0.05956832
	InSPyReNet	0.04750483
	GBVSBP	0.00994897
FlowerJar	EDN	N/A *
	LeNo	N/A *
	InSPyReNet	0.14024730
	GBVSBP	0.18039248
GasTank	EDN	0.37510784
	LeNo	0.10899083
	InSPyReNet	0.22762135
	GBVSBP	0.04482456
GreenBook	EDN	0.12753811
	LeNo	0.04932267
	InSPyReNet	0.05293706
	GBVSBP	0.02629686
RinseBottle	EDN	0.06945133
	LeNo	0.26319021
	InSPyReNet	0.11594654
	GBVSBP	0.07172215
DasBoot	EDN	0.25387577
	LeNo	0.2114202
	InSPyReNet	0.25797751
	GBVSBP	0.03501514
Extinguisher	EDN	0.20903451
	LeNo	0.14175203
	InSPyReNet	0.17548323
	GBVSBP	0.04535888
CardboardBox	EDN	N/A *
	LeNo	0.32436333
	InSPyReNet	0.3766473
	GBVSBP	N/A *
WoodenBox	EDN	0.58540699
	LeNo	0.38827273
	InSPyReNet	0.49671172
	GBVSBP	0.08235115
BackPack	EDN	N/A *
	LeNo	0.15253459
	InSPyReNet	0.15465638
	GBVSBP	N/A *
PaperRoll	EDN	0.55478108
	LeNo	0.36925125
	InSPyReNet	0.45234128
	GBVSBP	0.35134591
WhiteDog	EDN	0.1720007
	LeNo	0.28966198
	InSPyReNet	0.18140550
	GBVSBP	0.40507388
Igloo	EDN	0.30896307
	LeNo	0.26223192
	InSPyReNet	0.25463686
	GBVSBP	0.22047119
Alcohol	EDN	0.17315696
	LeNo	0.17300348
	InSPyReNet	0.06156967
	GBVSBP	0.31946609
GymBall	EDN	0.23029317
	LeNo	N/A *
	InSPyReNet	N/A *
	GBVSBP	0.13283110
Skull	EDN	0.16169256
	LeNo	0.27576345
	InSPyReNet	0.16997893
	GBVSBP	0.14140444

* Point cloud not available due to unfeasible 3D reconstruction.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Barrera Román, A.; Olague, G.; Clemente, E.; Olague, M. SOD3D: A Salient Object Detection Dataset for Photogrammetric 3D Reconstruction. Data 2026, 11, 157. https://doi.org/10.3390/data11070157

AMA Style

Barrera Román A, Olague G, Clemente E, Olague M. SOD3D: A Salient Object Detection Dataset for Photogrammetric 3D Reconstruction. Data. 2026; 11(7):157. https://doi.org/10.3390/data11070157

Chicago/Turabian Style

Barrera Román, Aarón, Gustavo Olague, Eddie Clemente, and Matthieu Olague. 2026. "SOD3D: A Salient Object Detection Dataset for Photogrammetric 3D Reconstruction" Data 11, no. 7: 157. https://doi.org/10.3390/data11070157

APA Style

Barrera Román, A., Olague, G., Clemente, E., & Olague, M. (2026). SOD3D: A Salient Object Detection Dataset for Photogrammetric 3D Reconstruction. Data, 11(7), 157. https://doi.org/10.3390/data11070157

Article Menu

SOD3D: A Salient Object Detection Dataset for Photogrammetric 3D Reconstruction

Abstract

1. Summary

2. Data Description

3. Methods

3.1. Initial Setup

3.1.1. Optical Parameters

3.1.2. Camera Network Configuration

3.1.3. Object Selection

3.1.4. Environmental Conditions

3.2. Data Processing

3.2.1. Background Removal

Ground Truth

SOD Algorithms

Binarization

3.2.2. Overlay Generation

3.2.3. 3D Reconstruction

4. Evaluation Metrics

5. User Notes

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Dataset Overview

Appendix B. 3D Reconstruction Parameters

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI