Previous Article in Journal
Synthetic Reference Energy Community Load Profiles for Artificial Case Studies
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Data Descriptor

SOD3D: A Salient Object Detection Dataset for Photogrammetric 3D Reconstruction

1
Department of Computer Science, Centro de Investigación Científica y Educación Superior de Ensenada, Ensenada 22860, Mexico
2
Department of Computer Science, Tecnológico Nacional de México/CENIDET, Cuernavaca 62490, Mexico
3
IBM Technology Campus Guadalajara, El Salto 45680, Mexico
*
Author to whom correspondence should be addressed.
Data 2026, 11(7), 157; https://doi.org/10.3390/data11070157 (registering DOI)
Submission received: 14 April 2026 / Revised: 28 May 2026 / Accepted: 29 May 2026 / Published: 25 June 2026
(This article belongs to the Section Information Systems and Data Management)

Abstract

Three-dimensional (3D) reconstruction from a photogrammetric perspective aims to infer the geometric structure of a scene from a set of images, including the recovery of depth information inherently lost during image acquisition. Conventional photogrammetric pipelines rely on multiple handcrafted processing stages, often requiring manual intervention. This work introduces a dataset designed to support the study of background removal techniques in photogrammetric workflows through salient object detection (SOD). The dataset comprises 15,120 images divided into sets of 28 distinct objects, each set including 36 high-resolution RGB images captured from multiple viewpoints. Additionally, each set provides 36 manually segmented images, as well as automatically segmented versions obtained using four different SOD algorithms. To facilitate evaluation and reproducibility, 153 reconstructed 3D models are provided across all object categories, and a 3D reconstruction evaluation methodology based on the Chamfer Distance metric is proposed, enabling the analysis of the impact of different segmentation strategies on 3D reconstruction. The dataset offers a benchmark resource for the development, comparison, and validation of methods aimed at improving photogrammetric pipelines through automated information filtering.
Dataset License: CC-BY 4.0

1. Summary

3D reconstruction from a photogrammetry standpoint is the process of inferring geometric features of a scene from a set of pictures. These geometric features include depth information from the original scene, which is unavoidably lost when capturing images. There is a broadly known and well-established process to perform this task called the photogrammetric pipeline, ref. [1]; nevertheless, several stages of this process require handcrafted tasks to be performed. We introduce salient object detection (SOD) as a key topic for 3D reconstruction, since this technique enables the filtering of non-relevant information from images by identifying the regions that belong to the objects of interest, which are known as the proto-object. Formally, the SOD problem is defined as follows: Given a set of images I = I 1 ,   I 2 ,   , I n and their corresponding proto-objects P = P 1 ,   P 2 ,   , P n , the SOD model establishes a relationship between an image ( I i ) and the proto-object ( P i ). This information filtering technique becomes convenient for feeding only useful 2D information to the photogrammetric pipeline, thus avoiding manual postprocessing in the 3D domain. Therefore, the SOD3D dataset allows researchers to evaluate the impact of an SOD stage applied to 3D reconstruction and contrast the results between multiple visual attention algorithms. In this work, we provide as examples one symbolic approach and three other sub-symbolic approaches based on deep learning (DL) techniques.
The SOD3D dataset is designed to test the application of SOD to enhance the photogrammetric 3D reconstruction; thus, we explored the state of the art in datasets related to these two problems: SOD and photogrammetric 3D reconstruction. For salient object detection, the benchmark datasets were developed in the previous decade. Ref. [2] was presented in 2015 with the main goal of addressing the limitations of earlier datasets, which were considered to be simple and not representative of real-world conditions. Later that year, ref. [3] was introduced, seeking to add complexity regarding foreground–background contrast and multiple salient objects. One of the largest and most widely used benchmarks is ref. [4], developed in 2017, claiming to be more challenging in terms of image complexity. Later, in 2022, ref. [5] was presented, introducing new challenges like blurry images, object occlusion, and background clutter, with the remarkable new consideration of having no salient objects in some of the presented images. In addition to the fact that these SOD datasets lack high-resolution images, none of them is focused on the problem of 3D reconstruction. However, more recently introduced datasets do address the problem of 3D reconstruction. Ref. [6] was presented by Meta in 2021, including over 1.5 million images from 50 different object categories, making it the most extensive dataset designed for the 3D reconstruction problem. Ref. [7] was later introduced in 2023, focused on a building with challenging textures. Introduced in 2025 for the medical field, ref. [8] addresses the problem of non-invasive monitoring of chronic wounds. Besides the fact that the camera network for acquiring the images for these 3D datasets is not standardized, none of them includes manually segmented versions of the images where the object of interest is isolated from the background, since they do not consider SOD as part of the 3D reconstruction problem. From the state of the art, we can see that the uniqueness of the proposed dataset comes from identifying an overlap and addressing critical concerns from both the SOD and 3D reconstruction problems: high-resolution imagery, standardized acquisition, and manually segmented ground truths.
In summary, the public release of this dataset offers several key advantages for researchers working in the field of photogrammetric 3D reconstruction:
  • To the best of our knowledge, our proposal establishes a benchmark through a high-resolution dataset of original and processed images specifically designed for studying visual attention algorithms applied to photogrammetric 3D reconstruction.
  • The dataset was developed by acquiring original images in a controlled and standardized manner, allowing researchers to fairly assess the performance of different SOD algorithms from a 3D reconstruction perspective.
  • The release of this data enables researchers to measure the impact of an SOD stage on 3D reconstruction and compare multiple visual attention algorithms: one based on a symbolic approach and three additional sub-symbolic approaches based on DL techniques.
  • By including both automatically produced saliency maps and manually generated ground truth images, the dataset provides the necessary baselines to fairly evaluate the performance of different SOD algorithms against a defined metric.

2. Data Description

SOD3D is a testing dataset built to evaluate the impact of an SOD stage applied to 3D reconstruction. It consists of images of 28 objects of different sizes, shapes, and textures, as well as their processed versions. Starting with 36 original photographs taken for each object from various points of view, a manually segmented ground truth is included for each original image as a baseline for SOD algorithm evaluation. Furthermore, saliency maps generated with four SOD algorithms, one symbolic, based on the brain programming methodology, and three sub-symbolic, based on DL techniques, are included. Additionally, binarized versions of each saliency map, referred to as proto-objects, are included to evaluate the performance of the SOD algorithms. Furthermore, overlays obtained by masking the original images with the proto-objects are included as an additional image set. Finally, a set of 3D reconstructions is provided in the form of point clouds. The repository’s root contains a directory for each object. Inside these directories are folders for the original images and their processed versions. The generic folder structure of the repository is depicted in Figure 1.
Inside each object’s folder, a total of six folders contain the original images, the manually segmented ground truth, the saliency maps generated by each SOD algorithm, the proto-objects derived from the saliency maps, the overlays generated from masking the original images with the proto-objects, and the final reconstructions resulting from each object after applying each proposed technique. As shown in Figure 1, the Original and GroundTruth folders contain the raw images and their manually segmented versions. Then, the SaliencyMap and ProtoObject folders share the same structure, where an independent directory is reserved for each SOD algorithm. Similarly, Overlay includes a folder for the data derived from each SOD algorithm, with the addition of a folder called GroundTruth, which contains the data derived from processing the original images using the manually segmented ground truth, to serve as a reference. Finally, the Reconstruction folder contains the available 3D reconstructions in the form of point clouds for each applied technique. Each file name includes a prefix that contains identifiers that encode the related object and the corresponding processing stage. Table 1 shows the naming conventions and file formats for each stage in the pipeline.

3. Methods

This section, divided into two parts, outlines the methodology proposed to acquire and process the data. It covers the initial setup for collecting raw data and the subsequent processing required to generate the final products included in the dataset.

3.1. Initial Setup

This section, divided into four parts, details how the images that constitute the core of the dataset were acquired. We will start by describing the camera used for the acquisition and the optical parameters selected for this purpose, then explain the environmental conditions and camera network configuration, and finally describe the objects selected for the dataset.

3.1.1. Optical Parameters

3D reconstruction in photogrammetry heavily relies on acquiring images where objects of interest appear crisp enough for the features on their surface to be properly detected, described, and identified. For this purpose, it is crucial that the whole surface of the object of interest appears sharp and in focus in each image. To meet this requirement, an appropriate camera and optical parameter configuration must be selected.
The camera setup selected for the image acquisition task was a Canon EOS 5D Mark III with a Canon EF 24–105 mm f/4 IS USM zoom lens. This camera has a full-frame sensor delivering a maximum spatial resolution of 5760 × 3840 pixels. The zoom lens can be adjusted to any focal length between 24 and 105 mm.
The depth of field is the distance between the closest and farthest objects that appear acceptably sharp in an image. To acquire images where the object of interest appears entirely focused, the depth of field must be deep enough to cover the entire object of interest, at the very least, regardless of camera position and attitude during the acquisition. Since the focus distance was set to be constant and relatively short (1 m) due to space limitations, the aperture and focal length were the parameters to be adjusted. The focal length was kept at relatively small values considering the short focus distance; however, to minimize perspective distortion, the shortest available focal length was avoided as much as possible, depending on the dimensions of the object of interest. Thus, focal length values between 24 and 35 mm were set during the acquisition. The only remaining optical parameter related to focus was the aperture, which is crucial for defining the image’s exposure.
The so-called exposure triangle comprises three key camera settings that work together to determine how light is captured and directly impact the final image exposure. The three parameters are the aperture, sensitivity, and shutter speed. The aperture consists of the actual diaphragm opening that controls the amount of light impacting the sensor. The sensitivity is the level of amplification of the amount of light detected by the sensor. The shutter speed is the period the sensor remains exposed to light during the capture. There is a trade-off between these three parameters since they are highly interrelated, and adjusting each one directly impacts the final exposure in the image.
To avoid known noise effects (e.g., grainy patterns in low-contrast areas) in the final image, sensitivity was adjusted to a low value of 200, which is also recommended for artificially illuminated building interiors. To ensure a suitable depth of field, an aperture value of f/8 was set. Low light levels resulting from the selected small aperture had to be compensated through longer exposure times, which were variable and automatically selected on each capture depending on the scene light conditions. For this purpose, the aperture priority setting was used during the whole acquisition. Finally, the high dynamic range (HDR) setting was active for each capture to ensure a homogeneous level of detail over the entire image. In this mode, three images with different exposure levels are acquired for each capture and automatically mixed to obtain an HDR image. Starting from an image with an ideal exposure value of 0, a sub-exposed image (exposure value of −3) is acquired to retain details in bright areas, and an over-exposed image (exposure value of +3) is also acquired to keep details in dark areas. Then, the three images are automatically merged into a single capture with evenly spread details across the whole image. This process is performed internally by the Canon EOS 5D Mark II as part of its HDR capabilities.

3.1.2. Camera Network Configuration

Photographs from different camera positions were acquired for each object with the purpose of recovering its three-dimensional features. Figure 2 depicts the camera network configuration used for the captures from side and top views. The camera positions were arbitrarily defined to maximize the object’s surface coverage while keeping them standard across the whole dataset despite variability in object shapes. For this purpose, 36 camera positions were defined over a 1 m radius spherical surface, where three imaginary rings were proposed: 0°, 30°, and 60° from the object’s vertical center. In the lower ring, which is 0° from the object’s vertical center (blue in Figure 2), 16 camera positions were established, 22.5° from one another. For the middle ring, which is 30° from the object’s vertical center (purple in Figure 2), 12 camera positions were established, 30° from one another. Finally, for the top ring, which is 60° from the object’s vertical center (red in Figure 2), eight camera positions were established, 45° from one another. This camera network configuration maintains a constant 1 m distance from the object’s vertical center for each photograph. Camera positions and orientations for the proposed camera network configuration were fixed manually, aided by markings located on the floor and a camera tripod with an adjustable height feature.

3.1.3. Object Selection

The dataset includes pictures of 28 different objects. The object list was carefully crafted, including inanimate objects of various sizes, shapes, and surface textures. Object sizes were defined considering space availability and optical constraints. Object heights ranging from 15 to 60 cm were selected to meet the aforementioned conditions. The shapes of the selected objects vary from low complexity for those whose shapes are similar to basic primitives (cubes, spheres, prisms, etc.) to medium to high complexity for those whose shapes are more organic (human-like, articulated, and so on). For the textures, objects made of different materials were selected to evaluate the impact of those features on the final 3D reconstruction while keeping reflective surfaces to a minimum, considering that this could cause problems during reconstruction due to poor feature extraction. Figure 3 shows example captures for the selected objects.

3.1.4. Environmental Conditions

3D reconstruction in photogrammetry heavily relies on acquiring images where the objects’ surface features are visible so they can be properly detected, described, and identified. For this purpose, a suitable environment for object photography was selected, aiming to have standard light conditions across the whole image acquisition stage. The place selected for this task was a building interior space with artificial cold white light, where the background remains static for each of the objects’ image sets. Slight changes in background configuration (e.g., background objects’ location and position) between different object sets were allowed. To ensure a variable background setting, the objects were placed on 28 paper wraps of different colors, textures, and patterns, intentionally causing consistent changes in the background between different objects. Hence, we associate each object with a unique paper wrap.

3.2. Data Processing

This section provides a description of the raw data processing procedures required to generate the final products included in the dataset. First, the background removal methodologies are presented, encompassing both manual processing for ground truth generation and automated approaches based on SOD algorithms. Next, the procedures for image binarization and overlay generation are described. Finally, the 3D reconstruction process is outlined.

3.2.1. Background Removal

This subsection is divided into three parts: ground truth, SOD algorithms, and binarization. The database includes manually segmented ground truth for each of the original images to serve the purpose of evaluating SOD algorithms. Furthermore, automatically segmented versions of the original images are included. Each set of images was processed with three sub-symbolic SOD algorithms based on DL and a symbolic SOD algorithm based on the brain programming paradigm.
Ground Truth
Image ground truth is obtained through a manual process where a contour is selected to segregate the group of pixels that belong to the object of interest (foreground) from those that do not (background). This action is performed in Adobe Photoshop using the Pen Tool. To have clear visibility between pixels, a zoom of 400% on a 24-inch Full HD monitor is used for the contour. After contouring the object, the loop between the initial and final points is closed. The contoured closed loop groups the pixels that belong to the foreground, while the excluded pixels comprise the background. Finally, a layer mask is applied to the loop so that the regions that constitute the object of interest appear in white, while the remaining areas appear in black in the final image. For some objects, multiple contours might be needed depending on their shape. This process can take up to 15 min per image, requiring up to 9 h to process each object’s set of 36 images, depending on the object’s complexity. Figure 4 depicts the ground truth creation process, which essentially lies in defining which pixels belong to the object of interest and distinguishing them from those that constitute the background. As shown in Figure 4, there is a trade-off between pixel detail and context, which is essential for properly selecting regions. As the zoom increases, independent pixels can be accurately identified, but the context of the image regions is lost.
SOD Algorithms
This section compares a symbolic algorithm under the brain programming paradigm with three different sub-symbolic algorithms based on DL techniques.
GBVSBP is an evolutionary computation paradigm that mimics the inner workings of the visual cortex, ref. [9]. Specifically, to solve the SOD task, the dorsal stream functions are replicated to create computational models able to segregate regions of the image that belong to the object of interest from those that belong to the background. The strategy adheres to a goal-oriented paradigm wherein learning is conceptualized as a symbolic optimization, whereby an individual is characterized by a template that delineates the visual attention (VA) model while simultaneously identifying essential components of the algorithm via artificial evolution. We selected this method because it represents a symbolic approach to SOD.
Introduced in 2022, adversarial robust SOD networks with a learnable noise (LeNo) module consist of a shallow noise inspired by the VA mechanism embedded in the encoder, initialized with a cross-shaped Gaussian distribution, and a noise estimation affecting only a single channel of the decoder rather than adding more network elements for postprocessing, ref. [10]. This technique contributes to the SOD’s increased robustness by outperforming earlier research while also improving inference speed. We selected this network due to the authors’ claim that LeNo is robust to noise.
Proposed in 2022, SOD via the Extremely Downsampled Network (EDN) applies a strategy to learn a global view of the complete image, resulting in precise salient object localization, ref. [11]. A scale-correlated pyramid convolution is constructed to improve multi-level feature fusion and recover object details from the extreme downsampling, achieving state-of-the-art performance and real-time speed. We selected this network to test its repeatability against high-resolution images.
The Inverse Saliency Pyramid Reconstruction Network (InSPyReNet) was added to our work due to its continued competitive performance against more recent SOD models, offering an effective balance between state-of-the-art accuracy and computational efficiency under the resource constraints and testing methodology of this study. It was proposed in 2022 and introduced an image pyramid-based framework for high-resolution (HR) SOD without requiring HR training datasets, ref. [12]. The network produces a strict image pyramid structure of a saliency map that enables pyramid-based image blending, achieved with a dedicated method that synthesizes low-resolution (LR) and HR image pyramids to mitigate effective receptive field discrepancies during HR prediction. Multiple SOD metrics and boundary accuracy measure evaluations performed on public LR and HR SOD benchmarks demonstrated that InSPyReNet surpassed state-of-the-art methods.
To establish a baseline, the image database used to train the aforementioned models is known as FT (frequency tuned). This database was initially introduced by Achanta et al. and is still used to study the SOD problem. FT’s ground truth is obtained by performing a manual object-contour segmentation over images of animate and inanimate objects, ref. [13]. We selected this database due to the similarity to our dataset, although we use high-resolution images of inanimate objects captured from different perspectives. This aspect is relevant since the information registered in our dataset contains projections of a three-dimensional scene, while all other datasets studied within SOD portray an image processing problem.
Binarization
This section explains the process needed to convert the standard output of an SOD algorithm, which is a grayscale image, to a binary output. Depending on the type of algorithm, the resulting image can have different characteristics. For the case of algorithms based on DL, which are unable to work with images at their original size, the input is first scaled to be fed to the neural networks. This unavoidably causes the aspect ratio to be lost and spatial distortion to be introduced. Naturally, the output is also affected by these transformations, resulting in the original salient object image being a square-scaled version of the expected output. At the end of the process, the algorithms based on DL rescale the output image to the size of the original input image, recovering the aspect ratio but introducing additional distortion as a result of this operation. Furthermore, when stretching the original output, some spatial properties of the image are lost since object edges change from regions where a radical transition between white and black pixels is observed to gradients with multiple shades between black and white.
The symbolic SOD algorithm based on the brain programming paradigm is able to work with images at their original size, and its output is a saliency map, which is a grayscale image where multiple shades of gray denote different attention levels. Thus, regions marked in black indicate the lowest attention level, while those marked in white indicate the highest attention level. Any value in between indicates an intermediate attention level, with brighter values representing higher attention and darker values representing lower attention.
Considering the nature of the output images for each SOD algorithm (saliency maps), an image binarization process is necessary to generate the corresponding proto-objects. Hence, the regions in the image that belong to the salient object appear in white, while those that belong to the background appear in black. To this end, a binarization level is selected to establish the threshold for setting a pixel to white or black in the final image. To establish a fair binarization level for each algorithm output, an optimal threshold is selected by finding the value that maximizes the overlap between the binarized saliency map and the manually segmented ground truth. Algorithm 1 explains how the optimal threshold is defined and used to generate the corresponding proto-object from an input saliency map.
Algorithm 1: Generate proto-object
Purpose: Create a proto-object by binarizing a saliency map with the optimal threshold.
Input:
  • saliency_map: Grayscale image to be binarized.
  • groundtruth: Manually segmented binary ground truth.
Functions:
  • Binarize (image, threshold): Returns the input image binarized by the input threshold.
  • FMeasure (image1, image2): Returns the F-Measure value from the two input binary images.
Variables:
  • threshold: Stores a value used for binarization.
  • optimal_threshold: Stores the best threshold.
  • f_measure: Stores an F-Measure result.
  • max_f_measure: Stores the maximum F-Measure.
  • binarized_image: Stores a binarized image.
Output:
  • proto_object: Proto-object generated by binarizing an input image by the optimal threshold.

1:    f_measure ← 0
2:    max_f_measure ← 0
3:    best_threshold ← 0
4:    for (threshold ← 1 to 255) do
5:   binarized_imageBinarize (saliency_map, threshold)
6:   f_measureFMeasure (binarized_image, groundtruth)
7:   if f_measure > max_f_measure then
8:     max_f_measuref_measure
        best_thresholdthreshold
9:   end if
10:    end for
11:    proto_objectBinarize (saliency_map, threshold)
12:    return proto_object
More formally, considering the ground truth image G n , where n : 1 n 36 and a proto-object P m , where m : 0 m 255 , it is necessary to find the value that maximizes the overlap between G n and P m . For this, we use the F β measure defined as follows:
F β = 1 + β 2 p · r β 2 p + r
where p is precision p : 0 p 1 , r is recall r : 0 r 1 , and β is the parameter that controls the balance between p and r , β : 0 β . F β measures the effectiveness of the overlap considering β 2 = 0.3 to emphasize precision following the standard protocol [9]. Then, the overlap between G n and P m is given by the following relation, considering the pair G n , P m and m thresholds:
a r g m a x P ( F β G n = 1 , P m )
The maximum argument P m that maximizes the function F β is then calculated. We repeat this process for the n different ground truths corresponding to each image.
Figure 5 shows an example of an image processed with the four different SOD algorithms. There, a clear performance comparison between automatic segmentation techniques can be observed.

3.2.2. Overlay Generation

The primary purpose of the SOD3D dataset is to evaluate the impact of an SOD stage applied to 3D reconstruction. To achieve this, the proto-objects are then used to mask the original images acquired by the camera. As a result, we obtain an image in which all the regions belonging to the background are masked in black, and those that belong to the object of interest retain their original pixel values. These images are called overlays and are the ones to be fed to the photogrammetric pipeline for 3D reconstruction. Figure 6 shows an example of an overlay generated by masking an original image with a proto-object.

3.2.3. 3D Reconstruction

The SOD3D dataset was developed to measure the impact of an SOD stage applied to 3D reconstruction. To this aim, the database includes 3D point clouds that represent the reconstructed objects using the traditional photogrammetry pipeline from each applicable set of images.
The fundamental principle of the photogrammetry pipeline relies on obtaining multiple views of the scene and then triangulating the locations between matching points to estimate their 3D coordinates. This process is incrementally extended to additional views with the aim of constructing a 3D point cloud. The default workflow in Meshroom was used to apply the photogrammetric pipeline over the input images to perform the 3D reconstruction. Based on this workflow, this pipeline can be divided into some critical stages, which are outlined below:
  • Feature extraction: This stage establishes the foundation for finding the relative pose of cameras in 3D space. Its primary goal is to extract distinctive groups of pixels that remain robust to viewpoint variations during image acquisition. Consequently, a scene feature should produce consistent descriptors across all captured images. The SIFT (Scale-Invariant Feature Transform) algorithm, ref. [14], is employed to extract keypoints and generate a set of image feature descriptors. Texture complexity may vary significantly both across different images and within local regions of the same image, causing the number of extracted features to fluctuate considerably. To address this issue, a post-filtering step is applied to limit the total number of features to a reasonable amount. Finally, grid filtering is used to ensure a homogeneous distribution of feature points throughout the image.
  • Image matching: The objective of this stage is to identify visual overlap between images, based on their shared feature content. To achieve this, image retrieval techniques are employed to identify images with overlapping content, avoiding the high computational cost of exhaustive feature matching. This is accomplished by representing each image as a compact descriptor, allowing for highly efficient distance evaluations between images. A widely used method for generating global image descriptors is the vocabulary tree approach, ref. [15]. Extracted image feature descriptors are propagated down the tree, hierarchically classified at each node, and assigned to a specific leaf node. Each descriptor is thus reduced to a simple leaf index, and the final image descriptor is then formed by collecting these leaf indices. Finally, the shared feature content between images can be evaluated by comparing these descriptors.
  • Feature matching: In this stage, features are matched between image pairs, which will subsequently provide the geometric foundation to reconstruct an initial 3D structure from the underlying 2D data. First, photometric matching between the descriptors of the two input images is performed. Each feature in image A is mapped to a list of candidate features from image B. This list is then refined under the assumption that only one valid match exists between the images. Specifically, for each descriptor in the first image, its two closest neighbors in the second image are identified, and a relative distance ratio threshold, ref. [16], is applied. This process yields a list of candidate matches based purely on photometric criteria. Since finding the two closest descriptors for each feature is computationally expensive using a brute-force approach, optimized algorithms are typically employed; while Approximate Nearest Neighbor is the most common, alternatives like Cascading Hashing are also widely used. Geometric filtering is performed using epipolar geometry within the random sample consensus (RANSAC) framework for outlier rejection, ref. [17]. By randomly selecting a small set of feature correspondences, the fundamental matrix is computed. Then the number of features that conform to this geometric model is evaluated and iterated through the RANSAC loop to find the optimal consensus set.
  • Structure from motion: This stage represents the core of the whole photogrammetry pipeline. It analyzes the geometric relationships between 2D input images to reconstruct a 3D model of the scene and simultaneously determine the pose and internal calibration of all cameras. The incremental pipeline is an iterative reconstruction process that begins with an initial two-view reconstruction and sequentially extends it by integrating new views. To initiate this, feature matches across image pairs are fused into tracks, where each track ideally represents a unique 3D point in space observed by multiple cameras. However, at this stage of the pipeline, the tracks still contain a significant number of outliers. To mitigate this, inconsistent tracks are filtered out during the fusion process. Next, the incremental pipeline must select the optimal initial image pair, which is crucial for a high-quality final reconstruction. This baseline pair must provide robust feature matches and guarantee reliable geometric constraints. Consequently, the selection process prioritizes pairs that maximize both the total match count and the spatial distribution of these features across the image planes. Simultaneously, the baseline must maintain a sufficiently wide triangulation angle between the camera views to ensure robust 3D triangulation. Then, the fundamental matrix between these two images is computed, setting the first camera as the origin of the 3D coordinate system. With the relative pose of the first two cameras established, the corresponding 2D feature matches are triangulated to generate the initial set of 3D points. Subsequently, the pipeline selects new images that share a sufficient number of correspondences with the existing 3D point cloud. This process is known as Next-Best-View selection. Utilizing these 2D–3D associations, the algorithm performs camera resectioning to estimate the pose of each new view. This resectioning step employs a Perspective-n-Point algorithm embedded within a RANSAC framework to robustly determine the camera pose that yields the highest consensus of feature matches. A final non-linear minimization step is then applied to refine each camera pose. Following the estimation of these new camera poses, tracks that become visible across two or more resected views are triangulated into 3D points. Next, a global Bundle Adjustment is executed to optimize all parameters together, including the camera intrinsics, extrinsics, and 3D point positions. To maintain reconstruction accuracy, the Bundle Adjustment results are filtered by removing observations that exhibit a high reprojection error or an insufficient triangulation angle. The triangulation of these new points expands the available candidate images for subsequent Next-Best-View selection. This process is executed iteratively: integrating new camera views, triangulating newly observed 2D features into 3D points, and filtering out any 3D points that become geometrically invalidated. This optimization loop continues until no remaining camera views can be localized, ref. [18].
  • Depth map estimation: In this stage, a depth value for each input pixel is estimated. This is achieved by analyzing the similarities between neighboring cameras in volumetric regions around the intersection of their optical axes in the 3D space. For each reference image, the N nearest neighboring cameras are selected. The fronto-parallel planes are selected based on the intersection of the reference optical axis with the pixels of these neighboring cameras. This plane-sweeping approach constructs a matching volume of dimensions W , H , Z , representing multiple depth candidates per pixel. The matching similarity for all depth candidates is then evaluated using zero-mean normalized cross-correlation computed over a small patch from the reference image reprojected onto the neighboring views. This process generates a raw similarity volume, where photometric costs from each neighboring view are accumulated. Because this volume is inherently noisy, a spatial filtering step is applied along the X and Y axes to aggregate local costs. This filtering effectively suppresses isolated high-similarity outliers. From this regularized volume, the optimal depth is determined by selecting the local minima, mapping the chosen plane indices to their corresponding continuous depth values to populate an initial depth map. Since this depth map is restricted to the discrete intervals of the sampled planes, it exhibits banding artifacts. To resolve this, a sub-pixel refinement step is applied to achieve continuous and precise depth estimations. Finally, a filtering step is applied to ensure consistency between multiple cameras, ref. [19].
  • Meshing: This stage merges all depth maps into a single dense point cloud and then establishes a relationship between the points to create a surface. First, the individual depth maps are fused into a global octree structure, where geometrically compatible depth values are merged into unified octree cells. Then a 3D Delaunay tetrahedralization is performed over the resulting point cloud. Next, a visibility-based voting procedure is executed to compute weights for both the tetrahedral cells and the facets connecting them, ref. [20]. A Graph Cut Max-Flow algorithm, ref. [21], is subsequently applied to solve for the optimal volumetric segmentation, where the minimum cut defines the extracted manifold mesh surface. After discarding noisy or isolated cells along the boundary, Laplacian filtering is applied to suppress localized mesh artifacts. Finally, a mesh simplification step is performed to drastically reduce the vertex count while preserving structural features.
The 3D point clouds included in the SOD3D dataset are generated from the original images and the overlays obtained by masking them with the ground truth and the different proto-objects derived from the SOD algorithms. In this manner, six different 3D reconstructions for each object are expected to be generated from each set of images: the original images, the ground truth overlay, the GBVSBP overlay, the LeNo overlay, the EDN overlay, and the InSPyReNet overlay. Appendix B provides details on the parameter configuration used at each stage of the photogrammetry pipeline in Meshroom to generate the corresponding point clouds.
Note that certain image sets lack sufficient identifiable features for full reconstruction; consequently, there is no one-to-one correspondence between image sets and 3D point clouds. Despite these limitations, the dataset includes 153 reconstructed 3D models that represent the full diversity of the objects. Figure 7 illustrates examples of 3D models derived from the dataset’s image sets via a traditional photogrammetry pipeline.

4. Evaluation Metrics

This section outlines the proposed framework used to evaluate the quality of the 3D reconstructions generated by each method. It details the entire procedure, beginning with point cloud preprocessing to ensure a fair comparison and concluding with the specific metrics used for the assessment.
The 3D reconstructions generated by the methods described in the previous sections consist of points positioned and interconnected in 3D space. Assessing the quality of these point clouds requires a direct comparison against a known reference. For this purpose, we generated a 3D ground-truth point cloud by reconstructing a specialized set of image overlays, which were generated by masking the original images with the manually segmented ground truth.
To evaluate the geometric similarity between the reconstructed and reference point clouds, we measure the alignment of their 3D coordinates. The metric selected for this purpose is the Chamfer Distance (CD), which computes the average bidirectional nearest-neighbor distance between the two point sets. Due to its point-wise nature, CD is able to operate despite differences in point quantities between the clouds. Evaluating our 3D reconstructions via CD requires a preliminary alignment and registration step. Because CD does not inherently account for differences in pose—a condition present in our 3D reconstructions—the point clouds must share the same coordinate space and orientation to maximize point-to-point correspondence and guarantee a fair comparison. To achieve this, an initial orientation is calculated using Principal Component Analysis (PCA), a dimensionality reduction technique employed to identify the principal axes of the reconstructed point cloud for global coarse alignment. The alignment is then locally refined using the Iterative Closest Point (ICP) algorithm, which transforms the reconstructed point cloud to maximize correspondence with the target reference. Both PCA and ICP algorithms were implemented in Python 3.10.16, aided by the NumPy 1.26.3, Pandas 2.1.4, and Open3D 0.19.0 open-source libraries. It is important to note that a key limitation of the ICP algorithm is its high sensitivity to initial conditions; poor initial positioning of the source relative to the target point cloud can easily cause the algorithm to converge to a local minimum rather than the global optimum. To overcome this restriction and improve registration accuracy, we manually reoriented specific reconstructed point clouds prior to the automatic alignment steps. This manual operation was performed using CloudCompare 2.13.2, an open-source application. Once both the source 3D reconstructed and target reference point clouds achieve their final registered configuration, CD is finally computed to assess their geometric similarity.
Table 2 compares the proposed 3D ground truth with the reconstructed point clouds across all objects and techniques.
The comparison results between the proposed 3D ground truth and the reconstructed point clouds are summarized in Table 2. The CD metric measures the average distance between the two point clouds; therefore, a lower value indicates higher similarity, while a higher value denotes greater discrepancy. Note that there are SOD algorithms where the CD metric is marked as “N/A” (not available) for some of the objects. This inconsistency arises from the lack of identifiable features in certain image sets, primarily due to poor segmentation, which prevents successful 3D reconstruction; consequently, no metric can be reported.

5. User Notes

This section addresses the key limitations of our dataset, including the effects of arbitrarily defining the camera positions, the controlled environment, and the SOD model selection and training.
This dataset is focused on photogrammetric 3D reconstruction, whose primary goal is to recover 3D geometric features from a set of scene images. For this purpose, 36 images acquired from different camera positions are included in the dataset for each object to be reconstructed in 3D. The camera positions were arbitrarily defined, aiming to maximize the object’s surface coverage while keeping them standard across the whole dataset, despite the variety of object shapes. This represents a drawback for accurate 3D reconstruction since, according to ref. [22], carefully selecting these camera positions and attitudes minimizes errors in 3D geometric features, thereby enhancing 3D reconstruction quality. Future work can explore the correct camera network design according to each object’s morphology. Nevertheless, the creation of this dataset aims to study the impact of SOD on the 3D reconstruction; hence, this limitation does not void its value.
The original images were acquired under a controlled environment, where the illumination, distance, and camera positions were carefully selected. The dataset can be expanded to include pictures acquired under different environmental conditions (e.g., camera distances and positions, exterior environments, natural light, and so on) to measure their impact on the final 3D reconstruction.
In addition to the manually segmented ground truth for each original image, this dataset includes automatically segmented versions. These automatically segmented versions of the images were created using symbolic and sub-symbolic SOD algorithms. The FT image database was used to train each SOD approach applied in our work to establish a baseline. Further testing can be performed by diversifying the SOD algorithms applied to create this dataset and by training them with different databases.

Author Contributions

Conceptualization, A.B.R., G.O. and E.C.; methodology, A.B.R., G.O. and E.C.; software, A.B.R. and M.O.; validation, A.B.R., G.O., E.C. and M.O.; formal analysis, A.B.R., G.O. and E.C.; investigation, A.B.R.; resources, G.O. and E.C.; data curation, A.B.R. and M.O.; writing—original draft preparation, A.B.R.; writing—review and editing, A.B.R., G.O., E.C. and M.O.; visualization, A.B.R.; supervision, G.O. and E.C.; project administration, G.O. and E.C.; funding acquisition, G.O. and E.C. All authors have read and agreed to the published version of the manuscript.

Funding

The APC was funded by Centro de Investigación Científica y Educación Superior de Ensenada (CICESE), No. 31830 and Tecnológico Nacional de México/CENIDET, No. 24345.26-P.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this work is available at https://doi.org/10.17632/695w3dws5f.2 under a Creative Commons Attribution (CC BY 4.0) international license.

Acknowledgments

The authors dedicate this work to the memory of our friend, colleague and mentor, Gustavo Olague. His scientific vision, guidance, generosity, and passion for research profoundly influenced our professional and personal development. His legacy continues to inspire our work and will remain an enduring source of motivation for future generations of researchers.

Conflicts of Interest

Author Matthieu Olague was employed by the company IBM Technology, Campus Guadalajara. However, there is no financial or any other type of relationship between the company and this research work. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
SODSalient Object Detection
DLDeep Learning
HDRHigh Dynamic Range
VAVisual Attention
HRHigh Resolution
LRLow Resolution
CDChamfer Distance
PCAPrincipal Component Analysis
ICPIterative Closest Point
SIFTScale-Invariant Feature Transform
RANSACRandom Sample Consensus

Appendix A. Dataset Overview

This appendix provides an overview of the dataset. Table A1 includes the subject area, data format, how and where the data was collected, and how it can be accessed.
Table A1. Dataset specifications table.
Table A1. Dataset specifications table.
SubjectComputer Sciences
Specific subject area3D Reconstruction, Photogrammetry, Image Segmentation. Computer Vision and Pattern Recognition, Applied Machine Learning, Computer Graphics, and Computer-Aided Design.
Type of dataImage (RGB JPG, Grayscale PNG, Binary PNG), 3D point cloud (CSV)
Data collectionThis repository contains a total of 15,120 and 153 reconstructed 3D models. The dataset is divided into 28 sets of high-resolution images of different everyday objects. Each set comprises 36 images of each object, acquired from various camera positions with the primary goal of reconstructing the object through photogrammetric techniques. The 1008 original images were taken with a Canon EOS 5D Mark III camera and a 24–105 mm variable lens, keeping optical parameters as fixed as possible to standardize image acquisition. To evaluate salient object detection (SOD) algorithms, the database includes manually segmented ground truth for each original image (Adobe Photoshop CS6 13.0). Furthermore, the repository contains automatically segmented versions of the images processed with three state-of-the-art deep learning (DL)-based SOD algorithms (Python 3.10.16) and an additional set of images processed with a symbolic SOD algorithm based on the brain programming paradigm (Matlab R2017b). Finally, for 3D reconstructions, the database includes 3D point clouds for each object (Meshroom-2021.1.0).
Data source locationThe FT database was recovered from [13].
The SOD3D database was collected from an office building located in Mexicali, Baja California, Mexico.
Data accessibilityRepository name: SOD3D: A salient object detection dataset for photogrammetric 3D reconstruction
Data identification number: 10.17632/695w3dws5f.2
Direct URL to data: https://data.mendeley.com/datasets/695w3dws5f/2 (accessed on 26 May 2026)
Instructions for accessing these data: The dataset is available at Mendeley Data.

Appendix B. 3D Reconstruction Parameters

This appendix provides details on the parameter configuration used for generating the 3D point clouds from each set of images in Meshroom-2021.1.0. Table A2 includes the name of each stage in shaded cells and the name and value for each parameter.
Table A2. 3D reconstruction parameter configuration.
Table A2. 3D reconstruction parameter configuration.
Camera Init
Sensor DatabasecameraSensors.db
Default Field Of View45
Group Camera FallbackFolder
Allowed Camera ModelsPinhole, radial1, radial3, brown, fisheye4, fisheye1
Apply internal white balance1
ViewId Methodmetadata
Verbose Levelinfo
Feature Extraction
Describer Typessift
Describer Densityultra
Describer Qualityultra
Contrast FilteringGridSort
Grid Filtering1
Force CPU Extraction1
Max Nb Threads0
Verbose Levelinfo
Image Matching
MethodVocabularyTree
Voc Tree: Treevlfeat_K80L3.SIFT.tree
Voc Tree: Minimal Number of Images200
Voc Tree: Max Descriptors500
Voc Tree: Nb Matches50
Max Nb Threads0
Verbose Levelinfo
Feature Matching
Photometric Matching MethodANN_L2
Geometric Estimatoracransac
Geometric Filter TypeFundamental_matrix
Distance Ratio0.8
Max Iteration2048
Geometric Validation Error0
Known Poses Geometric Error Max5
Max Matches0
Save Putative Matches2
Cross Matching2
Guided Matching2
Match From Known Camera Poses2
Export Debug Files2
Verbose Levelinfo
Structure From Motion
Localizer Estimatoracransac
Observation ConstraintBasic
Localizer Max Ransac Iterations4096
Localizer Max Ransac Error0
Lock Scene Previously Reconstructed2
Local Bundle Adjustment1
LocalBA Graph Distance1
Maximum Number of Matches0
Minimum Number of Matches0
Min Input Track Length2
Min Observation For Triangulation2
Min Angle For Triangulation3
Min Angle For Landmark2
Max Reprojection Error4
Min Angle Initial Pair5
Max Angle Initial Pair40
Use Only Matches From Input Folder2
Use Rig Constraint1
Force Lock of All Intrinsic Camera Params.2
Filter Track Forks2
Inter File Extension.abc
Verbose Levelinfo
Prepare Dense Scene
Output File Typeexr
Save Metadata1
Save Matrices Text Files2
Correct images exposure2
Verbose Levelinfo
Depth Map
Downscale2
Min View Angle2
Max View Ange70
SGM: Nb Neighbor Cameras10
SGM: WSH4
SGM: GammaC5.5
SGM: GammaP8
Refine: Nb Neighbor Cameras6
Refine: Number of Samples150
Refine: Number of Depths31
Refine: Number of Iterations100
Refine: WSH3
Refine: Sigma15
Refine: GammaC15.5
Refine: GammaP8
Refine: Tc or Rc pixel size2
Export Intermediate Results2
Number of GPUs0
Verbose Level Info
Depth Map Filter
Min View Angle2
Max View Ange70
Number of Nearest Cameras10
Min Consistent Cameras3
Min Consistent Cameras Bad Similarity4
Filtering Size in Pixels0
Filtering Size in Pixels Bad Similarity0
Compute Normal Maps2
Verbose Level info
Meshing
Custom Bounding Box2
Estimate Space From SfM1
Min Observations For SfM Space Est.3
Min Observations Angle For SfM Space Est.10
Max Input Points50,000,000
Max Points5,000,000
Max Points Per Voxel1,000,000
Min Step2
PartitioningsingleBlock
RepartitionmultiResolution
angleFactor15
simFactor15
pixSizeMarginInitCoef2
pixSizeMarginFinalCoef4
voteMarginFactor4
contributeMarginFactor2
simGaussianSizeInit10
simGaussianSize10
minAngleThreshold1
Refine Fuse1
Helper Points Grid Size10
Densify2
Nb Pixel Size Behind4
Full Weight1
Weakly Supported Surface Support1
Add Landmarks To The Dense Point Cloud2
Tretrahedron Neighbors Coherency Nb It.10
minSolidAngleRatio0.2
Nb Solid Angle Filtering Iterations2
Colorize Output2
Add Mask Helper Points2
Helper Points: Mask Segment Size50
Save Raw Dense Point Cloud2
Export DEBUG Tetrahedralization2
Seed0
Verbose Levelinfo
Mesh Filtering
Keep Only the Largest Mesh2
Smoothing Subsetall
Smoothing Boundaries Neighbors0
Smoothing Iterations5
Smoothing Lambda1
Filtering Subsetall
Filtering Iterations1
Filter Large Triangles Factor60
Filter Triangles Ratio0
Verbose Levelinfo
Texturing
Texture Side8192
Texture Downscale2
Texture File Typepng
Unwrap MethodBasic
Use UDIM1
Fill Holes2
Padding5
MultiBand Downscale4
MultiBand contributions High Freq1
MultiBand contributions Mid-High Freq5
MultiBand contributions Mid-Low Freq10
MultiBand contributions Low Freq0
Use Score1
Best Score Threshold0.1
Angle Hard Threshold90
Process ColorspacesRGB
Correct Exposure2
Force Visible By All Vertices2
Flip Normals2
Visibility Remapping MethodPullPush
Subdivision Target Ratio0.8
Verbose Levelinfo
1 ☑ Active parameter. 2 ☐ Inactive parameter.

References

  1. Griwodz, C.; Gasparini, S.; Calvet, L.; Gurdjos, P.; Castan, F.; Maujean, B.; De Lillo, G.; Lanthony, Y. AliceVision Meshroom: An open-source 3D reconstruction pipeline. In Proceedings of the 12th ACM Multimedia Systems Conference, Istanbul, Turkey, 28 September–1 October 2021. [Google Scholar] [CrossRef]
  2. Yan, Q.; Xu, L.; Shi, J.; Jia, J. Hierarchical Saliency Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013. [Google Scholar] [CrossRef]
  3. Li, G.; Yu, Y. Visual Saliency Based on Multiscale Deep Features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar] [CrossRef]
  4. Wang, L.; Huchuan, L.; Wang, Y.; Mengyang, F.; Wang, D.; Yin, B.; Ruan, X. Learning to Detect Salient Objects with Image-Level Supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 June 2017. [Google Scholar] [CrossRef]
  5. Fan, P.; Zhang, J.; Xu, G.; Cheng, M.; Shao, L. Salient Objects in Clutter. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2344–2366. [Google Scholar] [CrossRef] [PubMed]
  6. Reizenstein, J.; Shapovalov, R.; Henzler, P.; Sbordone, L.; Labatut, P.; Novotny, D.; Ruan, X. Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar] [CrossRef]
  7. Gabara, G.; Sawicki, P. CRBeDaSet: A Benchmark Dataset for High Accuracy Close Range 3D Object Reconstruction. Remote Sens. 2023, 15, 1116. [Google Scholar] [CrossRef]
  8. Chierchia, R.; Lebrat, L.; Ahmedt-Aristizabal, D.; Salvado, O.; Fookes, C.; Cruz, R. SALVE: A 3D Reconstruction Benchmark of Wounds from Consumer-Grade Videos. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Tucson, AZ, USA, 26 February–6 March 2025. [Google Scholar] [CrossRef]
  9. Olague, G.; Menendez-Clavijo, J.A.; Olague, M.; Ocampo, A.; Ibarra-Vazquez, G.; Ochoa, R.; Pineda, R. Automated design of salient object detection algorithms with brain programming. Appl. Sci. 2022, 12, 10686. [Google Scholar] [CrossRef]
  10. Wang, H.; Wan, L.; Tang, H. LeNo: Adversarial robust salient object detection networks with learnable noise. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, Washington D.C., USA, 7–14 February 2023. [Google Scholar] [CrossRef]
  11. Wu, Y.; Liu, Y.; Zhang, L.; Cheng, M.; Ren, B. EDN: Salient object detection via extremely-downsampled network. IEEE Trans. Image Process. 2022, 31, 3125–3136. [Google Scholar] [CrossRef] [PubMed]
  12. Taehun, K.; Kunhee, K.; Joonyeong, L.; Dongmin, C.; Jiho, L.; Daijin, K. Revisiting Image Pyramid Structure for High Resolution Salient Object Detection. In Proceedings of the Sixteenth Asian Conference on Computer Vision, Macao, China, 4–8 December 2022. [Google Scholar] [CrossRef]
  13. Achanta, R.; Hemami, S.; Estrada, F.; Susstrunk, S. Frequency-tuned salient region detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar] [CrossRef]
  14. Lowe, D. Object recognition from local scale-invariant features. In Proceedings of the IEEE International Conference on Computer Vision, Corfu, Greece, 20–25 September 1999. [Google Scholar] [CrossRef]
  15. Nister, D.; Stewenius, H. Scalable Recognition with a Vocabulary Tree. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York, NY, USA, 17–22 June 2006. [Google Scholar] [CrossRef]
  16. Lowe, D. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
  17. Fischler, M.; Bolles, R. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
  18. Hartley, R.; Zisserman, A. 3D Reconstruction of Cameras and Structure. In Multiple View Geometry in Computer Vision, 2nd ed.; Cambridge University Press: Cambridge, UK, 2003; pp. 262–278. [Google Scholar]
  19. Hirschmuller, H. Accurate and efficient stereo processing by semi-global matching and mutual information. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–26 June 2005. [Google Scholar] [CrossRef]
  20. Jancosek, M.; Pajdla, T. Multi-view reconstruction preserving weakly-supported surfaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA, 20–25 June 2011. [Google Scholar] [CrossRef]
  21. Boykov, Y.; Kolmogorov, V. An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision. IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 1124–1137. [Google Scholar] [CrossRef] [PubMed]
  22. Olague, G.; Mohr, R. Optimal camera placement for accurate reconstruction. Pattern Recognit. 2002, 35, 927–944. [Google Scholar] [CrossRef]
Figure 1. Repository organization and file naming conventions for the SOD3D dataset.
Figure 1. Repository organization and file naming conventions for the SOD3D dataset.
Data 11 00157 g001
Figure 2. Proposed camera network configuration: side (A) and top (B) views.
Figure 2. Proposed camera network configuration: side (A) and top (B) views.
Data 11 00157 g002
Figure 3. Example captures from the SOD3D dataset.
Figure 3. Example captures from the SOD3D dataset.
Data 11 00157 g003
Figure 4. Manual ground truth generation example.
Figure 4. Manual ground truth generation example.
Data 11 00157 g004
Figure 5. Comparison between saliency maps and proto-objects.
Figure 5. Comparison between saliency maps and proto-objects.
Data 11 00157 g005
Figure 6. When an original image (A) is masked by its proto-object (B), an image overlay (C) is generated.
Figure 6. When an original image (A) is masked by its proto-object (B), an image overlay (C) is generated.
Data 11 00157 g006
Figure 7. Examples of 3D reconstructed objects.
Figure 7. Examples of 3D reconstructed objects.
Data 11 00157 g007
Table 1. File naming conventions and formats.
Table 1. File naming conventions and formats.
FileFile Naming ConventionFile Format
Original captureObject_Orig_FileName.jpgRGB JPG image (5760 × 3840 px)
Ground truthObject_GT_FileName.pngBinary PNG image (5760 × 3840 px)
Saliency mapObject_SM_FileName.pngGrayscale PNG image (variable)
Proto-objectObject_PO_FileName.pngGrayscale PNG image (5760 × 3840 px)
OverlayObject_OL_FileName.jpgRGB JPG image (5760 × 3840 px)
ReconstructionObject_3D_FileName.csv3D coordinates CSV
Table 2. Point cloud quality evaluation.
Table 2. Point cloud quality evaluation.
ObjectAlgorithmChamfer Distance
SpideyEDN0.12493001
LeNo0.10088636
InSPyReNetN/A *
GBVSBP0.21061425
BriefcaseEDN0.1155459
LeNo0.08756472
InSPyReNet0.09247724
GBVSBP0.04844216
ClockEDNN/A *
LeNoN/A *
InSPyReNetN/A *
GBVSBP0.07456226
KittyEDN0.25639119
LeNo0.25939732
InSPyReNetN/A *
GBVSBP0.24165641
LunchBoxEDN0.03899479
LeNo0.40911003
InSPyReNet0.42182701
GBVSBP0.19837973
PowerSupplyEDN0.02840842
LeNo0.08677601
InSPyReNet0.20573723
GBVSBP0.02240334
ComputerEDN0.10303119
LeNo0.10168622
InSPyReNetN/A *
GBVSBP0.11541542
ScopeEDN0.06051309
LeNo0.30093739
InSPyReNet0.28118946
GBVSBP0.08124034
SoapBoxEDN0.53130419
LeNo0.48948671
InSPyReNet0.42890567
GBVSBP0.25665115
CoffeeCreamerEDN0.39874113
LeNo0.04602422
InSPyReNet0.08920528
GBVSBP0.05905232
WineBoxEDN0.36381791
LeNo0.23733872
InSPyReNet0.22208681
GBVSBP0.01867313
HelmetEDN0.21555422
LeNo0.13328942
InSPyReNetN/A *
GBVSBP0.27745244
CookieBoxEDN0.04638342
LeNo0.05956832
InSPyReNet0.04750483
GBVSBP0.00994897
FlowerJarEDNN/A *
LeNoN/A *
InSPyReNet0.14024730
GBVSBP0.18039248
GasTankEDN0.37510784
LeNo0.10899083
InSPyReNet0.22762135
GBVSBP0.04482456
GreenBookEDN0.12753811
LeNo0.04932267
InSPyReNet0.05293706
GBVSBP0.02629686
RinseBottleEDN0.06945133
LeNo0.26319021
InSPyReNet0.11594654
GBVSBP0.07172215
DasBootEDN0.25387577
LeNo0.2114202
InSPyReNet0.25797751
GBVSBP0.03501514
ExtinguisherEDN0.20903451
LeNo0.14175203
InSPyReNet0.17548323
GBVSBP0.04535888
CardboardBoxEDNN/A *
LeNo0.32436333
InSPyReNet0.3766473
GBVSBPN/A *
WoodenBoxEDN0.58540699
LeNo0.38827273
InSPyReNet0.49671172
GBVSBP0.08235115
BackPackEDNN/A *
LeNo0.15253459
InSPyReNet0.15465638
GBVSBPN/A *
PaperRollEDN0.55478108
LeNo0.36925125
InSPyReNet0.45234128
GBVSBP0.35134591
WhiteDogEDN0.1720007
LeNo0.28966198
InSPyReNet0.18140550
GBVSBP0.40507388
IglooEDN0.30896307
LeNo0.26223192
InSPyReNet0.25463686
GBVSBP0.22047119
AlcoholEDN0.17315696
LeNo0.17300348
InSPyReNet0.06156967
GBVSBP0.31946609
GymBallEDN0.23029317
LeNoN/A *
InSPyReNetN/A *
GBVSBP0.13283110
SkullEDN0.16169256
LeNo0.27576345
InSPyReNet0.16997893
GBVSBP0.14140444
* Point cloud not available due to unfeasible 3D reconstruction.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Barrera Román, A.; Olague, G.; Clemente, E.; Olague, M. SOD3D: A Salient Object Detection Dataset for Photogrammetric 3D Reconstruction. Data 2026, 11, 157. https://doi.org/10.3390/data11070157

AMA Style

Barrera Román A, Olague G, Clemente E, Olague M. SOD3D: A Salient Object Detection Dataset for Photogrammetric 3D Reconstruction. Data. 2026; 11(7):157. https://doi.org/10.3390/data11070157

Chicago/Turabian Style

Barrera Román, Aarón, Gustavo Olague, Eddie Clemente, and Matthieu Olague. 2026. "SOD3D: A Salient Object Detection Dataset for Photogrammetric 3D Reconstruction" Data 11, no. 7: 157. https://doi.org/10.3390/data11070157

APA Style

Barrera Román, A., Olague, G., Clemente, E., & Olague, M. (2026). SOD3D: A Salient Object Detection Dataset for Photogrammetric 3D Reconstruction. Data, 11(7), 157. https://doi.org/10.3390/data11070157

Article Metrics

Back to TopTop