Effects of Biases in Geometric and Physics-Based Imaging Attributes on Classification Performance

Rouhani, Bahman; Tsotsos, John K.

doi:10.3390/jimaging11100333

Open AccessArticle

Effects of Biases in Geometric and Physics-Based Imaging Attributes on Classification Performance

by

Bahman Rouhani

^*

and

John K. Tsotsos

Department of Electrical Engineering and Computer Science, York University, Toronto, ON M3J 1P3, Canada

^*

Author to whom correspondence should be addressed.

J. Imaging 2025, 11(10), 333; https://doi.org/10.3390/jimaging11100333

Submission received: 13 August 2025 / Revised: 8 September 2025 / Accepted: 18 September 2025 / Published: 25 September 2025

(This article belongs to the Section Computer Vision and Pattern Recognition)

Download

Browse Figures

Versions Notes

Abstract

Learned systems in the domain of visual recognition and cognition impress in part because even though they are trained with datasets many orders of magnitude smaller than the full population of possible images, they exhibit sufficient generalization to be applicable to new and previously unseen data. Since training data sets typically represent such a small sampling of any domain, the possibility of bias in their composition is very real. But what are the limits of generalization given such bias, and up to what point might it be sufficient for a real problem task? There are many types of bias as will be seen, but we focus only on one, selection bias. In vision, image contents are dependent on the physics of vision and geometry of the imaging process and not only on scene contents. How do biases in these factors—that is, non-uniform sample collection across the spectrum of imaging possibilities—affect learning? We address this in two ways. The first is theoretical in the tradition of the Thought Experiment. The point is to use a simple theoretical tool to probe into the bias of data collection to highlight deficiencies that might then deserve extra attention either in data collection or system development. Those theoretical results are then used to motivate practical tests on a new dataset using several existing top classifiers. We report that, both theoretically and empirically, there are some selection biases rooted in the physics and imaging geometry of vision that challenge current methods of classification.

Keywords:

generalization; bias in datasets; neural networks; machine learning; selection bias; imaging geometry

1. Introduction

We are motivated to conduct this presentation by two observations. First, there seems to be insufficient effort in the machine learning field to routinely validate training data in the sense of ensuring that all important data variations are sufficiently captured statistically and without bias. This deficiency has famously manifested itself in a variety of learned systems that make spectacular errors widely publicized in international media. Second, there is the belief that a system’s generalization properties can be ameliorated via manipulations of the given training data, and there is a huge variety of these from simple to very sophisticated [1]. Many claim that these suffice to remedy any problems arising due to data biases, yet those same problems seem to persist because generalizations cannot deal even with tiny adversarial manipulations (e.g., Xu et al. [2]). We have two goals for this paper. First, we wish to begin a discussion of these points and their implications, especially for practical systems where amelioration on its own is not a sufficient goal because there are engineering specifications to be met. Second, we will suggest a simple idea for determining which gaps in training data sampling might have important impact on system performance.

It is self-evident that how an artificial agent of any kind perceives its world is critical to how it reacts to events in its world. Within perception, we focus on vision as the primary perceptual source for humans and a capability that requires roughly half of the cerebral cortex to accomplish, thus being a major contributor to an agent’s overall intelligence. Vision is also a major perceptual source for robots or other artificial agents. We were also concerned that the explosive development of training sets for learned systems is insufficiently rooted in sound statistical science, particularly with respect to bias which for the tiny sizes of datasets (tiny relative to the full population) is likely a much larger problem than is being acknowledged. In other words, it might be that the current empirical methodologies seen in research and applications might benefit from improvement of their data sampling components.

It is certainly true that in vision, no current training set fully captures all dimensions of the representation of visual information. This may lead to selection bias which is observed when the selection of data samples leads to a result that is different from what would have been achieved had the entire target population been considered. We conduct a thought experiment on a real domain of visual objects that we can fully characterize and look at specific gaps in training data and their impact on qualitative performance as well as their impact on the engineering specifications required for acceptable performance in the domain. Our thought experiment points to three conclusions: first, generalization behaviour is determined by how well the particular dimensions of the domain are represented during training; second, the utility of generalization is dependent on the acceptable system error; and third, specific visual features of objects, such as pose orientations out of the imaging plane or colours, may not be recoverable if not represented sufficiently during training. This latter point emphasizes that the basic geometry of the imaging viewpoint and the physics of optical imaging cannot be overcome if not sufficiently represented in training. Any currently observed generalization in modern deep learning networks may be more the result of coincidental alignments than a true property of the network or its training regime, and whose utility needs to be confirmed with respect to a system’s performance specification. Whereas empirical confirmation may be the best method for such verification, it is a time- and resource-expensive activity. Our Thought Experiment Probe approach can be very informative as a first step towards understanding the impact of biases.

This work is presented in two parts. The first presents the problem and poses a thought experiment that leads to a particular conclusion about generalization and data bias. The second part tests the theoretical predictions and shows how the theoretical and empirical results agree. This paper will conclude with a general discussion on the nature of data bias and suggest some simple measures to deal with the problem. We begin with Part I.

PART I

1.1. Current Thinking About Generalization in Learned Systems

Generalizability refers to the performance difference of a model when evaluated on previously seen data (training data) versus data it has never seen before (testing data). Models with poor generalizability have overfitted to the training data. A deep learning image classifier is a mathematical function that maps images to classes. Many such classifiers have been applied productively in a variety of domains. A major reason for their success has been that they do not need to be trained with the full domain of the function in order to exhibit strong classification performance for previously unseen input. The reasons underlying this have been studied for some time yet are not well understood. Bartlett & Maass [3] write that the VC-dimension of a neural net with binary output measures its expressiveness. The derivation of bounds for the VC-dimension for both binary and real-valued neural nets has been challenging. Bounds for the VC-dimension of a neural net provide estimates for the number of random examples that are needed to train a network so that it has good generalization properties (i.e., so that the error of the network on new examples from the same distribution is at most

ϵ

, with probability

\geq 1 - σ)

. For a single application these bounds tend to be too large, since they provide a generalization guarantee for any probability distribution on the examples and for any training algorithm that minimizes disagreement on the training examples. This might seem to be what we seek, but this is not the case. As will become apparent, we are interested in the cases where the training and test distributions differ in their specific detail.

There are many who have written about network generalization and some recent examples follow. Among some of the most cited are works by Zhang et al. [4], Keskar et al. [5], Neyshabur et al. [6], Kawaguchi et al. [7], Wu & Zhu [8], Novak et al. [9], and most recently Fetaya et al. [10] and Wang et al. [11]. Zhang et al. [12] presented a simple experimental framework for defining and understanding a notion of effective capacity of machine learning models and showed that the effective capacity of several successful architectures is large enough to memorize the training data. They also distinguish optimization from generalization, arguing that formal measures to quantify generalization are still missing. Neyshabur et al. [6] discuss different candidate complexity measures that might explain generalization in neural networks. They reach no specific conclusions, however, showing how much is unresolved. Fetaya et al. [10] reveal that it is impossible to guarantee detectability of adversarially perturbed inputs even for near-optimal generative classifiers. Experimentally, they were able to train robust models for MNIST, but robustness completely breaks down on CIFAR10. They suggest that likelihood-based conditional generative models are ineffective for robust classification. Wang et al. [11] conduct experiments that shows CNNs leverage high frequency components of images which results in trading robustness for accuracy.

Wang & Ma [13] and Galanti et al. [14] explore the theoretical bounds of generalization in neural networks. Wang & Ma consider the generalization error bounds of a range of networks including CNNs trained by the Stochastic Gradient Descent (SGD) method and derive a bound dependent on the depth of the network, the number of samples, and the cumulative loss. Galanti et al.’s results imply that acceptable generalization bounds can be achieved on sparse networks without using weight sharing. In both studies, the dataset is assumed to contain i.i.d. samples drawn from a distribution, and no bias is assumed.

Zhao et al. [15] examine generalization in deep reinforcement learning and focus on three issues. First, they look at generalization from a deterministic training environment to a noisy and uncertain testing environment. Secondly, assuming one correctly models environmental variability in a training simulation, there is the question of whether an agent learns to generalize to future conditions drawn from the same distribution, or overfits to its specific training experiences. Thirdly, there is the effect on training due to the impossibility of predicting and accurately modelling the environmental conditions and variability an agent might encounter in the real world (i.e., a predictive model should be robust to, for example, camera type, initial agent state, etc.). They show that standard algorithms and architectures generalize poorly in the face of both noise and environmental shift. They propose a new set of generalization metrics as a way of assisting with these problems. It should be pointed out that there is an implicit assumption in their paper, common in most similar works, that the domain of interest can be effectively characterized and parameterized in the first place. That is not possible in general (see also studies by Barbu et al. [16] and Afifi & Brown [17] for additional evidence). Others, such as Senhaji et al. [18], consider multi-task learning and multi-domain learning as a way of providing behaviour that goes beyond a single task or domain while also minimizing the number of needed parameters. However, like all others, the implicit assumptions that a given domain can be known with respect to its parameters and that training sets adequately cover that set of parameters are made. Both assumptions are not always justified.

Bengio & Gingras [19] consider some of the input variables that are given for each particular training case, and the missing variables differ from case to case, and suggest that the simplest way to deal with this missing data problem is to replace the missing values by their unconditional mean. More recently, Van Buuren [20] details many methods for imputing missing data and their pitfalls. Lin & Tsai [21] advocate for the development of novel hybrid approaches by combining statistical and machine learning based techniques, the consideration of three evaluation metrics together, and missing data simulation for both training and testing datasets. Śmieja et al. [22] propose that missing data can be handled by replacing a neuron’s response in the first hidden layer by its expected value, thus proposing a neural/unit-level embodiment of Bengio & Gringras’s [19] proposal of using the unconditional mean. Chati et al. [23] fill gaps in a sparsely sampled dataset by fitting curves to the data and interpolating. Zhou et al. [24] propose MixStyle, a method for mixing the statistics of samples from different domains in early layers of CNNs to increase the robustness of neural networks to new out of distribution samples.

Garcia et al. [25] review methods for pattern classification when there is missing data. They group previous research into four categories: (1) methods that delete incomplete cases and develop classifiers using only the complete data portion; (2) methods that estimate missing data and learn classifiers using the edited set, i.e., complete data portion and incomplete patterns with inferred values; (3) methods that use model-based procedures, where the data distribution is procedurally modelled, e.g., by an expectation–maximization (EM) algorithm; and (4) methods where missing values are incorporated into the classifier. These seem to imply that one knows that there is a missing feature or data in advance, and this seems an unreasonable assumption in general. Polikar et al. [26] claim to address the missing feature problem but admit that they draw an equivalence between missing data and missing features. They assume that the complete set of variables in a feature vector is known but that data samples might have missing values for some.

Performance and generalization capacity of deep learning methods has also been studied under out-of-distribution (OOD) samples. A number of papers search for theoretical explanations of neural networks’ behaviour in these cases. Fridovich-Keil et al. [27] use the smoothness of a model’s predictions while perturbing a test image’s Fourier amplitude to test the robustness of the network. Simsek et al. [28] introduce confusion score as a measure of image difficulty. To calculate this confusion score, they quantify the discrepancies among an ensemble of networks. Other studies demonstrate performance drops and accuracy gaps by introducing new datasets for this purpose. Hendrycks et al. [29] curate four different OOD datasets as a benchmark of deep networks’ generalization to OOD data. These datasets target specific domain shifts in data, e.g., blurry images or different renditions such as paintings and embroideries of natural images (from Real Blurry Images dataset and ImageNet-R, respectively). Hendrycks et al. [30] introduce new datasets, called ImageNet-A and ImageNet-O. The former is developed by gathering challenging images related to ImageNet [31] classes that fool ImageNet classifiers, and the latter is a collection of images that fool ImageNet OOD detectors. For filtering out the OOD samples, images from ImageNet-22K [32] are classified by a ResNet-50 [33] trained on the ImageNet-1K dataset. Images that are classified as an ImageNet-1K class with high confidence are chosen for the dataset. Zhang et al. [12] categorize distribution shifts as (1) background shift, (2) corruption shift, meaning naturally occurring noise and corruptions in the image, (3) texture shift, and (4) style shift. They use different sets of datasets specifically representing each of these domain shifts (e.g., ImageNet-C [29]) and investigate their effects on vision transformers [34]. Finally, Alcorn et al. [35] test neural networks’ resilience to unfamiliar poses. They generate synthetic images and pose easily classifiable objects in unconventional angles. The neural networks misclassify these images significantly more than conventional images.

There have also been studies which unveil underlying biases in datasets and demonstrate their effect on the performance of models used in real life applications such as medical imaging, autonomous driving, and face recognition, to name a few. Medical imaging may have the most studies of dataset bias, as shown by Zong et al. [36], Larrazabal et al. [37], Drenkow et al. [38], Chen et al. [39], and Seyyed-Kalantari et al. [40]. The studies are numerous, and an in-depth representation of research conducted in this field requires a survey of its own. A few examples follow. Zong et al. [36] and Drenkow et al. [38] demonstrate the underlying bias across different medical datasets and demonstrate their effects on performance of deep models. Seyyed-Kalantari et al. [40] show the biases in chest x-ray datasets with respect to patient’s sex, age, and race. Chen et al. [39] and Larrazabal et al. [37] analyse the fairness metrics of representative machine learning methods on prevalent medical datasets and conclude that biases in medical datasets are a primary cause of model underperformance for groups under-represented in datasets.

Other applications of machine learning also suffer from bias in training data. Li et al. [41] find that biases in pedestrian datasets cause grave consequences. Specifically, for pedestrian detection models, the proportion of undetected children is around 20% higher than those of adults. Similarly, Llorca et al. [42] find that children, wheelchair users, and personal mobility vehicle users are extremely under-represented. Buolamwini et al. [43] demonstrate that skin tone distribution in face recognition datasets is largely unbalanced. Kortylewskui et al. [44,45] demonstrate the significant effects of these biases on model performance.

Although dataset bias has been extensively studied, to the best of our knowledge, no other studies have investigated this issue from a geometric and physical approach. Moreover, the study methods we have encountered are application specific or test model performance under OOD test sets; however, none specifically devise and design controlled biases to systematically measure their effect in different scenarios for the same objects.

1.2. The Issue of Bias in Data

Issues regarding bias in data are common in any discipline that relies on statistical information based on observational data (e.g., genetics [46]; epidemiology [47]). In epidemiology, bias has been defined as any systematic error in the design, conduct, or analysis of a study that leads to an erroneous estimate of an exposure’s effect on disease. Such bias reflects systematic errors in research methodology. The types of bias include selection, detection, observation, misclassification, and recall. These authors highlight that bias is a complex and well-studied topic (see the classic paper by Sackett [48] or the more recent by Grimes & Schulz [49], among many others). These mostly examine statistical issues in medicine; care is needed if relating these to machine learning, computer vision, or AI. Nevertheless, much is clearly instructive since that community has struggled with statistical bias for decades in the context of life-and-death decision-making. In other words, there is a very real cost if error is due to bias; sometimes, especially in the published literature, much of computer vision or AI research currently may not fully consider such costs.

Sackett’s highly cited 1979 paper provides a catalogue of biases. Bias may be introduced at any stage of a research enterprise. He gives the following high level types (in his words):

i.: In reading-up on the field (5);
ii.: In specifying and selecting the study sample (22);
iii.: In executing the experimental manoeuvre (or exposure) (5);
iv.: In measuring exposures and outcomes (13);
v.: In analysing the data (5);
vi.: In interpreting the analysis (5);
vii.: In publishing the results.

For six of these seven categories, he gives sub-classes, with their number appearing in parentheses after each in the above list. In sum, he points to a total of 56 types of bias that might impact a research effort. The use of the term ‘exposure’ in the above is appropriate for medicine (as in exposure to a virus, or to a drug). In computer vision or AI, the ‘experimental manoeuver’ corresponds to the attempt by the computer system to solve or perform a task.

Throughout the history of AI research, including all of its sub-fields, has this level of analysis with respect to bias ever been considered? Not really. Should it be now? Yes, especially as applications are beginning to touch safety-sensitive domains. In the current presentation, we focused primarily on the second item in Sackett’s list: bias in specifying and setting up the study sample. Even within this category he lists 22 sub-classes. It is not germane to our argument to list all here, and the interested reader is encouraged to seek out the original paper. It is very useful, however, to give a few for which it is more straightforward to cast their characteristics from medicine into the computer vision or AI domain. Here are 5 of the 22, in his words (in italics) followed by a possible interpretation relevant for AI:

Popularity bias: The admission of patients to some practices, institutions or procedures (surgery, autopsy) is influenced by the interest stirred up by the presenting condition and its possible causes.
In other words, if many researchers, companies, or governments are interested in a particular type of image (e.g., challenges sponsored by a conference or by a prize, the financial lure of start-ups and funding competitions, etc.), the more likely are samples of that type (or studies of this type as a whole) to be prioritized and others excluded. A potential example of this, could be generative AI models which produce images with colour palettes most similar to those of commercial images.
Wrong sample size bias: Samples which are too small can prove nothing; samples which are too large can prove anything.
How does one determine sample population size for a given task and domain? Most in the field (perhaps all) follow the maxim ‘the more data the better’ (e.g., Anderson [50]) with little regard to its suitability for the desired outcome. One could imagine for a specific field or application, some type of bias is present in all available data or data-gathering methods, thus more data can even degrade performance along a biased dimension.
Missing clinical data bias: Missing clinical data may be missing because they are normal, negative, never measured, or measured but never recorded.
Tsotsos et al. [51] show how the range of possible camera parameters are not well-sampled in modern datasets and how their setting makes a difference in performance. Similarly, datasets are not typically annotated with imaging geometry or lighting characteristics for each image. Yet it is clear each affects an image. Finally, there are many cases of particular data being missing, and several research efforts to ameliorate this problem were given in Section 1.1 above.
Diagnostic purity bias: When ‘pure’ diagnostic groups exclude co-morbidity, they may become non-representative.
The computer vision counterpart here reflects what Marr [52] prescribed in his theory: that it was designed for stimuli where target and background have a clear psychophysical boundary. That is, object context, clutter, occlusions, and lighting effects that somehow prevent a clear figure–ground segregation are not included in his theory. Many research works to this day still insist on images of this kind even though it is clear they represent only a small subset of all realistic images.
Unacceptable disease bias: When disorders are socially unacceptable (V.D., suicide, insanity), they tend to be under-reported.
In modern computer vision, ‘nuisance’ or ‘adversarial’ situations are under-reported or avoided by engineering around them, while others suggest they can be ignored.

These are just 5 of the 22 sub-categories. The point is to highlight that bias in how one decides on training sets is a well-known issue and has been studied for decades. The reader may disagree with our specific interpretations (and we invite better interpretations), but the existence of these potential sources of bias are well accepted and undeniable. Acknowledging and incorporating this thinking into machine learning can only benefit the community and the applicability of the results to real-world, safety-critical problems. The remainder of this paper will consider selection bias in general and propose a theoretical framework for exploring how the impact of specific training data ‘holes’ might impact generalization with respect to how it measures up to a required performance specification.

1.3. Setting up the Problem

None of the approaches in Section 1.1 seem to have made a definitive difference on the overall issue. None ask the question If a network is trained with a biased dataset that misses particular samples corresponding to some defining domain attribute, can it generalize to the full domain from which that training dataset was extracted? It is important to note that due to the impossibly large space of input, it is not feasible to empirically prove, using exhaustive testing, that generalization abilities are complete for any network. Perhaps a different perspective might be helpful.

For our purposes, we need the following. First, we note that our focus is on visible light images taken with conventional camera systems. Although the conclusions might apply to other kinds of images (LiDAR, thermal, etc.), these are not considered. Other domains are also not considered, such as NLP, but a similar analysis might be as applicable.

Second, the kinds of generalization errors the above papers report, let alone the empirical error, seem to not be in the kinds of ranges one might expect for high-performance engineering tasks. If a system has a maximum empirical error that suffices to pass an engineering specification, then generalization error should also be at that same specification. We will perform our analysis with this as a key consideration.

Third, we note that any image training or test set is a subset of the set of all possible discernible images. This might be trivially obvious, but there is an important point here. Consider images of size

32 \times 32

pixels with 3 bits per pixel (one bit per colour). The number of possible images of such size is

2^{3072}

or about

10^{300}

. The number of images with three bytes per pixel and 256 by 256 size is about

10^{500, 000}

. The trouble with such estimates is that most of these images do not represent scenes that could possibly have been captured by a camera. Pavlidis [53], who published the just-mentioned example, concludes that it is impractical to construct training or testing sets of images that cover a substantial subset of all possible images. He estimates that

5^{36}

is a lower bound on the number of discernible images (

5^{36}

is about

1.5 \times 1025

). Given this, even ImageNet whose 30 April 2010 size is documented at 14,197,122 images, is 18 orders of magnitude too small. For a particular domain (e.g., face classification), the size of all possible discernible images that contain faces is much, much smaller, but still must be quite large. Making assumptions about face location (e.g., is it centre in the image?), face pose (e.g., is it a full frontal head shot? does the head tilt?), image background (e.g., is it on a plain background?), etc., reduces the domain significantly, but do they do so sufficiently? Biases due to other factors are well documented as causing problems (skin tone, lighting, hair, glasses, masks, etc.). The conclusion here is that the domain of images, even if application-specific, can still contain far more instances than practical to represent or use. More importantly, the set of attributes that fully define the domain may be unknown (or unknowable).

Even though there are many effective approaches to creating a training set, bias seems not well handled. It seems quite easy to not include one or more of a domain’s attributes or to not have sufficient representation for a particular one within a training set. This is particularly true for ‘in the wild’ efforts. Missing data would mean that for some domain attribute, its value in a particular training sample is missing (the colour of the traffic light is missing, there is no cloudy scene, etc.). No learning method can possibly know in advance how many attributes define a given domain, nor can it know if enough training data is provided to sufficiently enable a high-quality feature extraction process. For example, suppose it is possible to collect a large number of images of several classes of objects in natural scenes but with the light source always in the same position (i.e., identical imaging geometry, such as if all images are taken on clear, sunny days at around 9 am). Would a network trained on this dataset generalize to images of those same objects in the same scenes but with differing imaging geometries (e.g., different light source position with respect to the scene and camera)? What about different camera parameter settings? There already is at least one set of studies to show that camera settings make a very large difference to algorithm performance, both classical algorithms and deep learning algorithms [51,54,55]. The initial training set could be considered biased with respect to the dimension of scene lighting since it did not provide the network with samples of many different variations. Similar situations can be considered for a large variety of other image variations (object size, object, colour, object 3D pose, contrast with background, etc.). It is very difficult to think that training sets can be easily defined that provide sufficient training samples along each possible variation. Any practical training process for a non-trivial and not artificially constrained domain is likely biased.

As noted earlier, the types of bias include selection, detection, observation, misclassification, and recall bias. For our purposes, the type of bias we focus on is selection bias (other forms of bias remain for future research). Selection bias can result when the selection of subjects leads to a result that is different from what would have been achieved had the entire target population been considered. In vision, a training set is a very tiny subset of the full population of all possible images. The problem is worse when one considers the feature level because of the combinatorially large number of feature combinations (e.g., Tsotsos [56]). Ahmad et al. [57] also point out this combinatorial nature and conclude that for each of these combinations, enough patterns must be included to accurately estimate the posterior density. The satisfaction of this may not be possible in practice unless the domain is small and very well defined. This is where the generalization properties observed in learned networks are hoped to help. However, the limitations of this hope are not well understood.

1.4. The Approach

Our main focus will be on the effect of domain attributes that are unrepresented in training sets due to the potential intractability of complete representations as mentioned above (or perhaps for other reasons as well). This includes any training dataset where for some defining attribute of the domain of interest, one or more of its possible values has no instance (e.g., if colour is an attribute then ‘red’ is a possible value but there is no instance of it in the training set). To address this problem empirically seems an intractable problem. We thus approach this in the tradition of the Thought Experiment, whose most famous instance is perhaps Schrödinger’s Cat. Our motivating question was given in the abstract: What are the limits of generalization given such bias, and up to what point might it be sufficient for a real problem task? To be more specific, If a network is trained with a dataset that misses particular values of some defining domain attribute, can it generalize to the full domain from which that training dataset was extracted while maintaining its performance accuracy? In vision, no current training set fully captures all properties of the representation of visual information (many datasets have quite broad coverage but we are unaware of any that cover the full range of possible direct and ambient illuminations or viewpoints, for example, for any particular domain). We run this thought experiment on a real domain of visual objects, LEGO^® bricks, that we can fully characterize and look at specific gaps in training data and their impact on qualitative performance as well as their impact on the engineering specifications required for acceptable performance in the domain. We reach the following conclusions: first, that generalization behaviour is dependent on how sufficiently the particular dimensions of the domain are represented during training; second, that the utility of any generalization is completely dependent on the acceptable system error; and third, that specific visual features of objects, such as pose orientations out of the imaging plane or colours, may not be recoverable if not represented sufficiently during training. These conclusions then guide a more tractable approach to the empirical investigation.

We proceed as follows. We will define a hypothetical task in a real problem domain, the domain of LEGO^® bricks (LEGO^® played no role in this research, and nothing in this manuscript can be considered as representing any aspect of the company or its products. The choice of these building blocks, as opposed to others, is due to their familiarity around the world as well as to the extensive public documentation provided by many authors on the WWW). We will then assume that the task can be sufficiently solved by a correctly trained neural network. The training dataset will then be manipulated, removing particular attribute values, and the assumption will be made that the network now trained with the altered dataset will generalize so that it performs as well as the original. The resulting network will be examined in order to determine how the generalization might come about. Specifically, the approach is Reductio ad Absurdum via Backcasting, and this will be further explained below. Our analysis will tie the feasibility of generalization for particular domain attributes to the target performance specifications, because for any applied domain, this seems a critical consideration. Part II of this paper will provide empirical support for our theoretical analysis. First, we specify our hypothetical problem domain and task.

2. The Domain of LEGO Bricks

Suppose the well-known toy manufacturer, LEGO^®, wishes to take advantage of computer vision advances to open a new premium service that allows customers to order any combination of bricks in their inventory. They wish to visually recognize LEGO^® bricks as they pass in front of a camera on a conveyor belt in order to sort them into bins depending on a specified definition (e.g., by colour or by shape, or by number of studs). The full variety of pieces is considered including many special purpose ones as the figures below will show. The bin definitions may vary depending on customer requests. A customer might order 10,000 LEGO^® bricks all of a particular colour but randomly selected otherwise. Another might wish all bricks that are of the ‘plate’ category with fewer than 12 studs, but of all colours. In order to visually recognize the bricks, we will use a deep learning approach; let us assume that the learning system we employ works perfectly and that training error equals zero. We also assume that the standard procedure to reach the best generalization is followed, as described in many of the papers cited earlier. It must be stressed that we do not seek a solution to this task (which would likely be straightforward). Rather, we use this scenario as the backdrop of a theoretical probe into generalization.

2.1. The LEGO Brick World

What is the dimensionality of the LEGO^® bricks domain? LEGO^® bricks can be fully characterized by the following attributes: colour, size, type, sub-type, dimension (

m \times n

) in studs, and part attributes. Fortunately, others have put much effort into documenting this domain. An online source of container labels that gives a very broad coverage of bricks that have actually been produced over the lifetime of LEGO^® is given by BrickGun Storage Labels [58]. (It should be clear that there are actually several such web sites where LEGO^® parts and colours are enumerated and catalogued, and they do not all agree in their details. Here, this catalogue is chosen, but we could have performed the analysis, changed in inconsequential ways, with any of the others). The full catalogue of the bricks presented:

Category	Number of Sub-Types in Each Category
Plates 1 × N	48
Plates 2 × N	49
Plates 3 × N–4 × N	25
Bricks 1 × N	48
Bricks 2 × N	33
Wedge-Plates	29
Tiles	49
Slopes	106
Plates 5 × N–6 × N plus Plates 5 × N+	22
Hinges	66
Wedge-Bricks	45

There are also label pages for the Technics^® series of bricks (Technic^® Bricks, Technic^® Beams, Technic^® Axles, Technic^® Gears, Technic^® Pins & Connectors), but these are not included here, and the count reflects only the classic LEGO^® collection. It should be noted, as will be seen in figures below, that a large assortment of unusual formations is included in these collections; all blocks shown in the following figures are from one of the above categories in Boen’s catalogue. N in the name of the category in the left column above refers to the number of studs per brick (e.g., a

2 \times 6

plate is a flat piece with studs placed 2 across and 6 deep). Some of Boen’s labels are for bins that store pairs of similar pieces, and these are not included in the above count. The counts given are of the unique different brick types in this label dataset; the total number of sub-types is 520. Many pieces do not neatly fit into simple categories, even though they are so included, and examples will appear below.

There is a separate label page of all possible brick colours in the BrickGun Storage Labels [58] catalogue, Colour Bars, and these total 141. Since any brick may have any colour, the number of possible unique LEGO^® pieces is 520 × 141 = 73,320. This is the space of all possibilities, but from a practical standpoint, not all are in current production. Diaz (2008) [59] provides the following estimates:

LEGO^® went from 12,000 different pieces to 6800 in the last few years, a number that includes the colour variations.
Staple colours are red, yellow, blue, green, black, and white. We thus assume equal numbers of each of 6 colours—this means 1133 unique brick types.
Approximately 19 billion LEGO^® elements are produced per year. Moreover, 2.16 million are moulded every hour, 36,000 every minute.
Additionally, 18 elements in every million produced fail to meet the company’s high standards.

For the purposes of our argument, this is sufficient to provide a good characterization of the size of the domain and of its dimensions (even though they are only coarsely specified or there are variations). For example, in the Bricks 1 × N type, there are 9 simple types, i.e.,

1 \times 1

,

1 \times 2

,

1 \times 3

,

1 \times 4

,

1 \times 6

,

1 \times 8

,

1 \times 10

,

1 \times 12

, and

1 \times 16

, but also 40 more complex shapes with varying counts of studs. The number of studs could be used as a dimension on its own, but it would only apply to a subset of all the types. Our argument will not depend on the exact ranges of each dimension.

2.2. Detection and Binning Scenario

The following are elements that define the scenario for our problem of sorting bricks into bins depending on a specified bin definition that satisfies customer requests (See the construction by Daniel West, https://www.youtube.com/watch?v=04JkdHEX3Yk, (accessed on 30 April 2022). As impressive as this is, it does not provide a solution for the task we pose. Nevertheless, he solved a variety of difficult mechanical problems whose solution will be taken as an existence proof for our scenario and not further discussed):

There is a conveyor belt moving at a speed compatible with the speed of production of the bricks, i.e., using the data described earlier, so that 600 elements can be processed per second. It is acceptable that many conveyor belts might be used operating in parallel.
Lighting is fixed; camera position and optical parameters are fixed and known; the camera’s optical axis is perpendicular to the conveyor belt and centred at the belt’s centreline (the camera is directly overhead and pointing down to the centre of the conveyor belt); shadows are minimized or non-existent; the appearance of the conveyor belt permits easy figure–ground segregation.
An appropriate system exists for understanding a customer request (e.g., “I need 500 bricks, only yellow and with fewer than 6 studs”) and deploying the appropriate systems for its realization.
The pose of each brick on the conveyor belt may be upright, upside-down, or on one of its other stable sides. Each brick has its own set of stable sides on which it might lie. Each brick would have at least 2 stable positions perhaps to a maximum of 6 (sides of a cube). We conservatively assume 3 as the average, but this is likely low.
When on their stable position, bricks may be at any orientation relative to the plane of the conveyor belt; we assume the recognition system understands orientation quantized to $5^{\circ}$ , i.e., 72 orientations (360 orientations seems unnecessarily fine-grained whereas 4 seems too coarse. The assumption is made to enable the ‘back-of-the-envelope’ kind of counting argument made here and could easily be some other sensible value. It also seems sensible due to the restrictions on camera viewpoint described earlier).
We assume that bricks do not overlap and thus do not occlude each other for the camera.
The number of possible images using the complete library of bricks can be given by 3 (poses) × 520 (brick types) × 141 (colours) × 72 (orientations) = 15,837,120 (single instance of each part at each orientation and pose). In our argument, we will use the actual production numbers described earlier, namely, 3 (poses) × 6800 (bricks of the standard 6 colours) × 72 (orientations) = 1,468,800.

Examples of user requests could be ‘all slopes’, ‘all

1 \times 4

plates’, ‘all bricks of colour C’, ‘all bricks with more than 4 studs’, or other similar requests. There may be more than one task executed concurrently, but this is not relevant to our argument. We further assume that the error for this binning task needs to be very small. The LEGO^® brick manufacturing error value is

1.8 \times 10^{- 5}

(according to a report by Diaz [59]), and it is assumed that any additional error associated with satisfying these user requests should not add to this error appreciably.

The number of possible images is easily manageable by contemporary infrastructure and also seems to be within the reach of the learning/memorization ability of current neural networks. In other words, a product manager at LEGO^® may just demand that such a dataset be generated and that a big enough neural network be used. However, our goal is to use this network as an experimental vehicle to explore generalization, and we will do this by investigating how generalization error is affected by deficiencies (selection biases) in training data.

3. The Thought Experiment: Setup

Thought experiments generally:

Challenge (or even refute) a prevailing theory, often involving proof via contradiction (Reductio ad Absurdum);
Confirm a prevailing theory;
Establish a new theory;
Simultaneously refute a prevailing theory and establish a new theory through a process of mutual exclusion.

The tactic we will employ here is Reductio ad Absurdum via the methodology of Backcasting (introduced by Robinson [60]) in order to challenge the prevailing belief that deep learning networks generalize in a useful manner and to probe into generalization behaviour. Backcasting involves establishing the description of a very definite and very specific future situation. It then involves an imaginary moving backwards in time, step-by-step, in as many stages as are considered necessary, from the future to the present to reveal the mechanism through which that particular specified future could be attained from the present. See Figure 1 for a graphical depiction. Backcasting is not concerned with predicting the future; the major distinguishing characteristic of Backcasting analyses is the concern, not with likely futures, but with how desirable futures can be attained.

Assume as the starting point for our Backcasting tactic that the recognition task for the LEGO^® brick binning problem can be learned, and that when trained and validated on the full set of samples as defined by the above description, network A is produced with empirical error less than

1.8 \times 10^{- 5}

(the overall manufacturing error set by LEGO^®). Of course, this might mean that the full set is memorized; and since it is the full set, no generalization is even needed. But this is not relevant here. It does not matter how the network achieves this accuracy, only that it does. We then probe the speculation that if training of the same architecture is biased, its generalization properties will overcome the bias with the same acceptable performance level. We will examine how this speculation might be verified by examining the logical functions that would be needed in the architecture and its performance under that speculation. If we fail the performance specification, this would be the proof by contradiction (Reductio ad Absurdum) that our speculation was flawed.

An important assumption is that regardless of the network configuration, the network includes all the classes that would be required to satisfy any user query (classes for brick shape, colour, or size defined by number of studs). Supervised learning would include all the required known classes. In other words, the range of possible user queries defines the output classes of the network. We also assume that any user query is parsed correctly. Then, to create the specific consequent required by Backcasting, the exact architecture (specifically all output classes required for any user query are kept intact) of this network is re-initialized and re-trained with a biased training set. This hypothetical may seem worrisome, but this is part of the imagination of the experiment. If, for example, a complete re-training of a new network was undertaken with the biased dataset, it might be that one or more output classes corresponding to specifics of user queries would not be present. This would of course be unacceptable for the application, but also would not allow the thought experiment to proceed. Such a scenario reminds of a study by Xue et al. [61], where they conclude that if labels for a particular category are absent from a dataset, the network will treat the unlabelled object in the image as background or ‘other’. In our case, background would mean the conveyor belt and we assume labels also exist.

If a network effectively generalizes, this means that its test error is less than some bound determined by the problem domain. For the LEGO^® problem domain, the empirical error bound can be assumed as being less than 0.000018 based on the problem domain manufacturing error described earlier. The goal is to minimize generalization error, the difference between expected error and empirical errors, G. As mentioned earlier, the binning process should not exceed such a level of error, and, therefore, we will assume

G < 0.000018

is a reasonable target for overall error.

Assume the newly trained system A (the system responsible only for visual recognition of the bricks) satisfies this target;

G_{A} < 0.000018

. The subscript on G will denote the particular network being considered. The main concern is generalization of the visual recognition performance under different training biases. No assumptions are made about any particular capabilities of the network. For example, CNN’s are by design somewhat location insensitive. However, all LEGO^® blocks are displayed centrally so this is not relevant. No particular knowledge about the task is present in the network before the learning phase.

Why is this a reasonable test? It is a common complaint that datasets for autonomous driving, for example, often are predominantly created in good weather, with no snow or fog or heavy rain or hail, with little traffic, or have other idiosyncrasies that make their datasets represent incomplete training populations. Although the full domain for driving necessarily contains all weather elements, all traffic elements, etc., it is impossible to fully represent these in any finite dataset. Thus, a problem is evident: how can the system architect ensure appropriate generalization behaviour when the training set may not include any samples from a significant portion of the domain population? Is this even feasible? Our thought experiment clearly is as relevant as this. We will explore the issues as our thought experiment unfolds.

4. The Thought Experiment: Cases Considered

Assume training in our LEGO^® domain is conducted while leaving out one value of some attribute dimension from the training set creating a new network

A^{- i}

where the i signifies the i-th dimension where one of its possible values is not represented at all in the training set. This would not be unlike training an autonomous car using a dataset where only sunny, daylight scenes are included. That ‘left out’ dimension i, with respect to visual appearance of the bricks, is one of the following: colour (e.g., leave out one colour), size (e.g., leave out all

2 \times 2

bricks), shape (e.g., leave out all slopes), orientation (e.g., leave out images where the long axis of a brick is parallel to the conveyor belt), or pose (e.g., leave out all upside-down poses). We could also consider more than one such ‘left out’ value or dimension. For each case, we will consider the resulting generalization error.

Going back to Figure 1, the flowchart of the Backcasting process, we use the ‘current status quo’, namely the state-of-the-art in computer vision, to ‘create specific consequent’, namely the network

A^{- i}

, which will be assumed to satisfy all production specifications LEGO^® would require. We will consider a number of different sources of possible training biases and their impact on error for this network. For each, the overall error will be estimated and given as the sum of the error of the full network A (

G_{A} < 0.000018

) and any new errors introduced as a result of the changes in the training set. As in Figure 1, one or more steps will be taken towards the ‘logical antecedent’, in other words, towards what must be true in order for network

A^{- i}

to perform as assumed. Whether or not the logical antecedent is true or even realistic will determine our conclusions regarding generalization.

4.1. Training Set Missing One Brick Colour

Suppose one colour of the staple LEGO^® brick set possibilities (red, blue, green, white, black, or yellow) is left out of the training dataset and that the resulting system

A^{- c}

satisfies the empirical error stipulated above. This means 5 colours are sufficiently represented and thus are learned correctly. This places the state of our argument at the ’create specific consequent’ stage of the Backcasting flowchart in Figure 1. We now proceed to the boxes in the figure labelled 1 through to the ‘logical antecedent’. (These steps will be true for each of the cases described below and not further repeated in those descriptions). Although the colour label set referred to above includes 141 colours, these 6 are the standard ones, and in any case, sufficient to demonstrate the point here (A different author says the 2016 LEGO^® Colour Palette has 39 solid, 3 metallic, 1 glowing, and 14 translucent colours. Among these, there is 1 red, 2 different yellows, and several blues and greens. https://www.bricklink.com/v3/studio/design.page?idModel=169765 (accessed on 30 April 2022). We could redo our experiment with red or with both yellows missing and the result would be the same). If those 5 colours are learned and thus ‘in the weights’ of the network, then those weights might be invoked when the ‘left out’ colour is seen.

Colour is complicated (see Koenderink’s paper [62]). The most common representation for images, such as in those datasets in common use, is that each pixel in an image is represented by R (red), G (green), and B (blue) values, each coded using 8 bits. In other words, for each of the colours, 256 gradations are possible yielding a potential 16,777,216 colours. Luminance or intensity is not independently coded but is rather derived from these as an average of the three values. The values for each of R, B, and G come from image sensor outputs, and the most common practice is for the sensor to employ a Bayer filter (a pattern of colour filters with specific spatial distribution) and then to be further processed, including to de-mosaic the image. These colour filters are designed to specific characteristics; B is a low pass filter, G is a band-pass filter, and R is a high-pass filter, attempting to fully cover the range 400–700 nm. Spatially, the Bayer filter is arranged to have 2 green filters for each red or blue, that is, each image pixel is the result of these 4 samples, 2 green, 1 red, and 1 blue. The filter pass ranges overlap only by small amounts. The goal is to match the wavelength absorption characteristics of human retinal cones, which very roughly are as follows: S (blue) 350–475 nm, peaking at 419 nm; M (green) 430–625 nm, peaking at 531 nm; S (red) 460–650 nm, peaking at 559 nm. Rods, for completeness, respond from about 400 to 590 nm, peaking at 496 nm. An interesting characteristic is that beyond 550 nm, only the M and L systems respond. Koenderink points out that the distributions in the human eye are not ideal in how they cover the spectrum; an ideal system would more equally cover the full spectrum. This is what the Bayer filter attempts to accomplish. Of course, there are many colour models and encoding schemes, but this is most common.

Suppose the ‘left out’ brick colour is yellow. As Koenderink [62] points out, the spectra of common objects span most of the visible light spectrum (see his Figure 5.18 [62] that shows 6 different spectra of real ‘yellow’ objects; for each, most of the visible spectrum has some representation except blue). So, if one considers the RGB representation across all the bricks, most of the other coloured bricks will have some yellow spectral content but this will never be independently represented. The portion of the spectrum we commonly see as yellow is quite narrow, centred around 580 nm. In both human vision and using the Bayer filters, these wavelengths are not sampled by the blue component at all. In human vision, the absorption spectra of the L and M cones have good sensitivity to the yellow wavelengths (their peaks are at 531 and 559, respectively). However, using Bayer filters, most applications try to minimize overlap; so, at the yellow wavelengths, sensitivity is typically low, less than 50% of the peak.

The question we ask is, if no yellow brick is part of the development of

A^{- c}

, what is the resulting generalization error? Since colour is ‘in the weights’ of this network, and all non-yellow objects likely exhibit some yellow in their reflectance spectra, we now ask if it is enough to enable classification of a yellow brick via mixing.

The colour humans perceive is a complex function of the spectrum of the illuminant (and ambient reflected spectra), the angle of incidence of the illuminant on the surface being observed, the albedo of the surface, the angle of the viewer with respect to the surface, and the surrounding colours of the surface point being observed. Most of this is captured by the well-known Bidirectional Reflectance Distribution Function (BRDF) (see Koenderink’s paper [62] for more detail). In our brick binning task, much of this can be ignored. The illuminant, its angle to the conveyor belt, the camera and its angle to the conveyor belt, and the surrounding surface are all constant. The variables remaining are the surface albedo and the angle between the camera’s optical axis and the brick surfaces being imaged.

It has long been accepted that colours can be formed by additive mixtures of other primary colours (Koenderink provides a history with some emphasis on Grassman’s laws from 1853. Bouma in 1947 writes an interpretation of these as follows: “If we select three suitable spectral distributions (three kinds of light) we can reproduce each colour completely by additive mixing of these three basic colours (also called primary colours). The desired result can only be attained by one particular proportion of the quantities of the primary colours”). For our purposes, this means that even though some spectrum of colour is ‘in the weights’ of the network, their combination cannot necessarily result in any other colour if a primary colour is missing. Colour theory for mixtures tells us the following:

Green is created by combining equal parts of blue and yellow;
Black can be made with equal parts red, yellow, and blue, or blue and orange, red and green, or yellow and purple (The author of https://brickarchitect.com/color/ (accessed on 30 April 2022) points out that the LEGO Black is not a true black, but rather a very dark gray, and LEGO White is actually a light orange-ish gray);
White can be created by mixing red, green, and blue; alternatively, a yellow (say, 580 nm) and a blue (420 nm) will also give white.

However, red, blue, and yellow are primary colours and cannot be composed as a mixture of other colours. If a primary colour is unseen during training, there can be no set of weights that would represent it as a combination of other colours. This would mean that if a primary colour is unseen, it could not be classified. In a brick classification task such as ours, colour is an important dimension because it divides the entire population into 6 groups; then within each group are the different block types.

The above are all additive mixing rules. Subtractive mixing should be considered as well because networks employ both positive and negative weights, and it might be that this is an alternate avenue for dealing with the colour. The additive rules arise by using the RGB colour model while the subtractive rules are within the CMYK colour model, not often seen in neural network formulations. The three primary colours typically used in subtractive colour mixing systems are cyan, magenta, and yellow. Cyan is composed of equal amounts of green and light blue. Magenta is composed of equal amounts of red and blue. Yellow is primary and cannot be composed of other colours in both colour models. In subtractive mixing, the absence of colour is white, and the presence of all three primary colours make a neutral dark gray or black. Each of the colours may be constructed as follows:

Red is created by mixing only magenta and yellow;
Green is created by mixing only cyan and yellow;
Blue is created by mixing only cyan and magenta;
Black can be approximated by mixing cyan, magenta, and yellow, but pure black is nearly impossible to achieve.

As Koenderink [62] says, when describing his Figure 5.18 [62], the best yellow paints scatter all the wavelengths except the shorter ones. In fact, the spectra he shows for ‘lemon skin’, ‘buttercup flower’, ‘yellow marigold’, or ‘yellow delicious apple’ have significant strength throughout the red and green regions. In practice, a strong yellow, such as in a brick, could appear as the triple

(r, g, b)

in an RGB representation, where

0 \leq r

, b,

g \leq 1

, and r,

g ≫ b

, b being close to 0, it would be reasonable to think that since no yellow brick is seen in training, that there would be no corresponding ability to classify it.

With some generous assumptions about how the colours are represented in the weights and how they combine through the network, there may be a route to combinations that lead to most colours. After all, this would reflect the distributed representations underlying such networks [63]: each entity is represented by a pattern of activity distributed over many computing elements, and each computing element is involved in representing many different entities. But colour theory tells us that no combinations can yield yellow; it cannot be overcome using a learning strategy that never sees samples of these colours nor by distributed representation strategies. The only conclusion possible is that network

A^{- c}

where

c \in (r e d, b l u e, y e l l o w)

will not properly generalize, but that

c \in (g r e e n, w h i t e, b l a c k)

might. This would of course be unacceptable to LEGO^®. On the other hand, a dataset that is biased towards this latter colour subset might actually exhibit performance metrics for test error that could appear promising; this would, however, be a false indication. An analysis of failure instances would reveal this problem.

There are many other colour spaces. Rasouli & Tsotsos (2017) [64] review 20 different spaces and show how each leads to different characteristics for the detectability of objects. Above, we show only two of these. It might be that the correct choice of colour space for particular training data sets leads to different generalization properties for learned systems. This requires further research but the methods described here might be helpful.

Our hypothetical network

A^{- c}

where

c \in (r e d, b l u e, y e l l o w)

would thus exhibit the following error. Leaving out one of these colours, if we assume all LEGO^® bricks are made in all colours equally, means that 1/6th of all bricks (since

1 / 6

th of all bricks cannot have their colour correctly classified) are erroneously binned. Since the total number of bricks possible is 1,468,800, as enumerated in Section 2.2, a 1/6th error (244,800/1,468,800 = 0.166667) overwhelms the manufacturing error of 0.000018. The overall error, as defined in Section 4, of G can be estimated as

G_{A^{- c}} < 0.166685

. For

A^{- c}

where

c \in (g r e e n, b l a c k, w h i t e)

, using generous training assumptions, we can assume no additional error, so

G_{A^{- c}} < 0.000018

.

4.2. Training Set Missing One Brick Size

Let us name the learned system where one of the sizes is left out of training

A^{- z}

,

z \in (s_{1}, s_{2}, \dots s_{k})

, the set of all block sizes, and that it satisfies the empirical error stipulated above. There are many sizes in the label set we are using, and not all sizes apply to all brick types. The majority of the speciality bricks are of a unique size. Common pieces have several sizes. For example, in the Bricks

1 \times N

type, there are 9 simple sizes, i.e.,

1 \times 1

,

1 \times 2

,

1 \times 3

,

1 \times 4

,

1 \times 6

,

1 \times 8

,

1 \times 10

,

1 \times 12

, and

1 \times 16

, but also 40 more complex shapes with varying counts of studs, thus of differing physical size when compared to other bricks but without variation of size with respect to that particular type of brick. In other words, there seem to be at least two different dimensions along which size might be represented: stud count and brick volume. There may be additional ways as well; let us consider stud count only (in Figure 2, the bricks in panels Figure 2a, Figure 2b, Figure 2c, and Figure 2d have 4, 1, 2, and 3 studs, respectively). The minimum stud count is 1, and the maximum is 54 in the label set referenced above. The accurate size distribution is tedious to enumerate; it will likely be as informative to assume that there are equal numbers of each stud count, that is, approximately 73,320/54 = 1358 using the full brick set, or 1133/54 = 21 using the production values for unique brick types cited above. Suppose one stud size is left out of the training set but that the resulting system

A^{- z}

satisfies the empirical error stipulated above. How could the system generalize to that missing stud count? One might imagine that if a simple linear piece with 4 studs is left out of the training data, with generous assumptions, that the learned portions of the network for the similar shapes might jointly fire and fill in the gap. This would mean that there is some combination of network elements that form a 4-stud straight piece, as shown in Figure 2a. Straightforward possibilities include Figure 2b–d. But then could the 3-stud piece in Figure 2e participate in this composition? How could the 4-stud bricks of Figure 2e–h be made? It is assumed that all the pieces in this figure, and in all other figures, are part of the training set for the ideal network A since they are all art of the brick sets enumerated in Section 2.1.

There are many similar questions given the variety of ways LEGO^® has found to make bricks with 4 studs. It seems that the composition with learned pieces might provide a partial answer but not a complete solution. Certainly, any error measure would be increased even if we assume a partial solution.

Our hypothetical network

A^{- z}

would exhibit the following worst-case error. If all the bricks of a single stud measure are binned incorrectly, that is, 21 out of 1130 bricks, the error is 21/1130 = 0.018584 which when summed to the overall production error gives a cumulative error of

G_{A^{- z}} < 0.018602

.

4.3. Training Set Missing One Brick Orientation

Suppose one orientation in the imaging plane (i.e., parallel with the conveyor belt—recall the imaging geometry described in Section 2.2) is left out of the training set and that the resulting system

A^{- o}

satisfies the empirical error stipulated above. Simple data augmentation methods, such as spatial shifts or within the image plane rotations, are likely to permit generalization to one, or perhaps, more orientations being omitted from a training set (more on data augmentation methods below). The lack of good representation of all orientations is probably not an insurmountable problem; this would, however, also depend on whether or not there is a need for precision grasping if a robotic manipulator is used and this is not further considered here. Data augmentation is a relevant method for reducing orientation-in-the-imaging-plane sampling biases in training data given the restricted imaging geometry. The hypothetical network

A^{- o}

exhibits an overall error that is not increased due to this bias, and thus we can assume that

G_{A^{- o}} < 0.000018

.

4.4. Training Set Missing One Brick 3D Pose

Suppose one 3D pose is left out and that the resulting system

A^{- p}

satisfies the empirical error stipulated above. The LEGO^® brick domain is well-suited to an aspect graph representation, and this is yet another advantage of choosing these bricks as our domain of interest. We could not easily enumerate all the characteristics of most other domains, certainly not of natural images. First, a brief overview of aspect graphs (from work by Cutler [65]) is presented.

An aspect is the topological appearance of an object when seen from a specific viewpoint. Imagine a sphere where each point on the sphere represents the viewing direction formed by that point and the sphere’s centre, as is shown in Figure 3. A Viewpoint Space Partition (VSP) is a partition of viewpoint space into maximal regions of constant aspect. An event is a change in topological appearance as viewpoint changes. A constant aspect then is a contiguous region on the sphere that is bounded by changes in topological appearance, i.e., events. The full space partition would show a different region for each face. Consider the simple example of a cube. The separate regions of the space partition would be composed of regions where only one side is visible, only two sides are visible, and only three sides are visible. There is no partition where 4 sides can be visible. There are 26 of these regions. An aspect graph is a graph with a node for every aspect and edges connecting adjacent aspects. The dual of a viewpoint space partition is the aspect graph. Aspect graphs can be made for convex polyhedra, non-convex polyhedra, general polyhedral scenes (same as the non-convex polyhedra case), and non-polyhedral objects (e.g., torus). In general, for convex objects, the size of the VSP and the aspect graph are of order

O (n^{3})

while for non-convex objects, the VSP and aspect graph are of order

O (n^{9})

under perspective projection (

O (n^{2})

and

O (n^{6})

under orthographic projection, respectively, where n is the number of aspects) [66]. The cube has fewer than this general number because it has 3 pairs of parallel planes that do not intersect and thus do not form a viewing possibility. Recognition algorithms that employ aspect graphs typically match a set of aspects to a possible reconstruction of a hypothesized object (Rosenfeld [67] proposes the first of these; Dickinson et al. [68] present a nice development using object primitives).

We will assume orthographic projection because we can engineer the imaging geometry to satisfy its properties. If one 3D pose is left out of the training set representation, for example, there are no images of any LEGO^® brick with its top surface facing the conveyor belt, then the following results. If one side is never seen, it affects all aspects in which it participates. For a cube, this would mean the following: the aspect of itself, the aspects with one neighbouring face (4), and the aspects with two neighbouring faces (4), for a total of nine aspects. In other words, 9 of the 26 aspects of the brick need to be generalized using the remaining 17 learned aspects. In other words, 35% (9/26) of the possible constellation of aspects required for recognition would not be available. Recognition would fail if the observed viewpoint led to a set of visible aspects that overlapped these 9 aspects. For bricks more complex than a cube, this would differ.

However, each possible missing pose does not necessarily represent a stable configuration for a brick lying on a conveyor belt. Of the brick types listed above, wedge-plates, tiles, bricks

1 \times N

, slopes, plates

3 \times N / 4 \times N

, plates

2 \times N

, plates

1 \times N

, many would have only 2 stable configurations: right-side up and upside-down with no possibility of a stable side-ways configuration.

Further, there are a number of shapes with unique characteristics such as those in Figure 4a–c. The first could appear on its side or on its back end with the protruding elements acting as stabilizers, so it might be tilted towards the camera. The second could easily be stud-side down, on each vertical side, or on the slant, but is not likely stable on its bottom or backside. The third might have 5 stable sides. In other words, the number of possible cells of an aspect partition differs for each brick type. It is quite possible to enumerate all of these given we have the library of part labels; however, that enumeration would be tedious, and an approximation will suffice. We assume, as mentioned earlier, that each unique brick has an average of 3 stable poses, so for 1130 unique brick types there would be a total of 3390 stable brick appearances due to changing pose. Call these expected stable poses

p_{1}

,

p_{2}

, and

p_{3}

(perhaps right side up, upside-down, and on the left side, to pick one possible set of examples). For our cube example, above, a missing face would impact 9 out of 26 aspects, or about a third of all aspects. In other words, the ’left out’ pose in

A^{- p}

must be one of the brick’s stable poses,

p_{1}

,

p_{2}

, or

p_{3}

, and not an arbitrary pose.

What is the impact of one stable pose, say

p_{1}

, of all brick types being left out of a training set? The aspect graph itself would be incorrect. Not only would one face (at least) be missing (or presumed flat), but its interactions with its neighbouring faces would be incorrect. Recognition of that particular brick, if based on the learned visible aspects, would be impaired unless the particular viewpoint the camera sees is one from which only properly learned aspects are visible. If a third of the aspects are affected, then assuming all viewpoints are equiprobable, one third of all views of this brick would lead to erroneous recognition.

It is worth pointing out that the blocks in Figure 4c–f are examples of objects for which degenerate views are possible. The cross-section of each is the same, only the length differs, so even though a block may be in a stable pose, the imaging geometry would yield an ambiguous situation [68].

Is it possible that the other aspects for this brick or the aspects of other bricks could together combine to permit correct recognition? We could consider this on a case-by-case basis, but this would be quite tedious. Nevertheless, it might be, with generous assumptions about the capabilities of the learned network, that the learned elements corresponding to the bricks in Figure 4d–f could combine to provide what is needed to recognize the slope of intermediate size, shown in the previous set. But there is no corresponding brick combination for the first one in the previous set, nor for many other bricks; they are all quite unique. Thus, recognition of those bricks is assumed to fail if the proper constellation of aspects is not seen.

It seems safe to think that the error would be larger than that of the ideal network which has already been pegged at

G_{A} < 0.000018

. If no brick is seen with

p_{2}

, for example, it means that one third, or 1130, of the possible brick appearances would have to be recognized as a result of generalization. We can make generous assumptions about the generalization capabilities of the network, specifically that it can successfully handle similar bricks such as the roof-like ones just depicted even if no training sample for a particular pose is present. However, a rough count in the brick library yields 132 unique pieces that have no similar ones from which a generalization can easily obtained, or in other words, 12% of the possible 1130 unique bricks.

A different possibility would be some kind of data augmentation in training set preparation. No data augmentation could remedy missing poses because no such method inserts patches out of other images (the stud surface appears in other poses of course). Even a data augmentation that could consider geometry would not be able to fill in the missing samples unless prior knowledge of what all the invisible aspects could look like, or assumptions of surface continuity, is somehow applied in the data augmentation process (see also work of Engstrom et al. [69] and Goodfellow et al. [70]). In any case, such an action has its own problems (as demonstrated by Rosenfeld et al. [71]) and would not be a sensible strategy.

Our hypothetical network

A^{- p}

would have the following error, assuming the least error implied by the above arguments. Suppose that the generalization process is effective for the bricks for which there might be additive ways of deriving a brick (even if not straightforward), but for the 132 unique bricks they are all classified incorrectly, a 12% error as described. Thus, an estimate of overall test error is

G_{A^{- p}} < 0.120018

. This estimate is likely too small.

4.5. Training Set Missing One Brick Shape

Suppose one shape is left out and that the resulting system

A^{- s}

satisfies the empirical error stipulated above. Shape is a more abstract notion here than the other dimensions considered. It is included because one might imagine a customer requesting an order of all sloped bricks, of all rectangular bricks, of all plates, etc. For example, the bricks in Figure 5a–d are all considered plates, and their regularity is the thickness of the base. On the other hand, the bricks in Figure 5e–h are classified as slopes because they all contain a sloped surface.

There are 106 slope brick sub-types. The question here is if all plates (or slopes, or bricks, or wedges, etc.) were left out of the training set, could the resulting network generalize so that they could be recognized sufficiently well?

Let us consider some of the details of our assumed network

A^{- s}

. These images contain no texture information useful to the task, thus we assume that none is learned. The imaging geometry, especially the lighting, is such that shadows or other cues for shape from shading are not useful. We can also leave out colour for the current purpose. It is well-known that many learned networks represent receptive field tunings in early layers very similar to oriented Gabor filters in their early layers, and this seems a good first level of processing for the brick images. One can easily imagine what a line drawing of each of the LEGO^® bricks might be; the bricks are textureless. It is an acceptable assumption then that processing then continues on such a representation.

How can shape be inferred from a line drawing? There is a wealth of literature on how computers might interpret line drawings. LEGO^® brick shapes are mostly polyhedra, and one of their shape characteristics (but not a complete characterization to be sure) is the set of labels of their lines and vertices. A labelling of an image is an assignment to each edge of the image of one of the symbols

+, -, ⇢

, and ⇠ (concavity or convexity), and similarly for each vertex is one of the many types of vertices (Clowes [72], Waltz [73]; several sources enumerate the set, but the exact cardinality is not important). Such a labelling is a reasonable assumption as a representation of shape sufficient to enable discrimination of one shape from another, although it is not difficult to show counterexamples. For the purposes of our argument, this does not really cause any difficulties. Such an ‘in principle’ solution is instructive even if not precisely what a trained neural network realizes.

Kirousis & Papadimitriou [74] examined the processing of images of straight lines on the plane that arise from polyhedral scenes. They asked, given an image, is there a scene of which the image is the projection? Such images are called realizable. One classical approach to the realizability problem is through a combinatorial necessary condition. A labelling is legal if at each node of the image there is only one of the legal patterns. A legal labelling is consistent with a realization of the image if the way the edges are seen from the projection plane is the way indicated by the labelling. They provided a proof that it is NP-complete, given an image, to tell whether it has a legal labelling. This is true for opaque polyhedra and is even true in the simpler case of trihedral scenes (no four planes share a point) without shadows and cracks. Although there are methods for labelling such a line drawing, the problem is that it is exponential to determine if the labelling corresponds to a real object. In other words, since any algorithm for extracting straight lines from images necessarily may have error, any error may signal an illegal labelling where there is none or a legal labelling where there is not one, so a step requiring the verification is needed. Once it is known that there is a legal labelling, there exist algorithms for matching labelled image to known objects that have known labellings. These results, importantly, are independent of algorithm or implementation; they apply to the problem as a whole.

Kirousis & Papadimitriou [74] also present an effective algorithm for the important special case of orthohedral scenes (all planes are normal to one of the three axes). It is tempting to think that this latter case applies to LEGO^® bricks; most are indeed approximately orthohedral (the studs pose the exception), but then again there are many pieces within the library that are strictly polyhedral or involve curved segments. For example, the class Plates

1 \times N

includes the bricks of Figure 6a, the class Plates

2 \times N

includes the bricks of Figure 6b. Slopes include the bricks of Figure 6c. The class Brick

1 \times N

includes bricks such as that of Figure 4d. Even in decomposition these will contain elements that are non-polyhedral.

The task of realizing a general, non-orthohedral scene, given its labelled image can be solved by employing linear programming, i.e., a polynomial time algorithm exists. So, the problem remains to find the legal labelling. It is known that the labelling of trihedral scenes is NP-complete as is the complexity of labelling origami scenes, that is, scenes constructed by assembling planar panels of negligible thickness (see work by Parodi [75], Sugihara [76], Malik [77]).

It is difficult to accept that a deep learning procedure can effectively learn the solution to an NP-complete problem; it might, however, learn approximate solutions that are within some error bound

ϵ^{- s}

for subsets of the full problem. We will not explore this route but suggest that this makes the generalization issue even more difficult to address.

Note that we still have not come to the generalization issue. Suppose no slopes are included in the training set but a customer desires a box of all slopes in the LEGO^® catalogue. This means that no instance of the configuration of lines and vertices seen in bricks such as that in Figure 6e have participated in any learning. It is highly unlikely that a network can construct a particular configuration out of learned elements that would suffice; after all, as shown, the problem in general is combinatorial and even generous assumptions about learned networks cannot defeat this fact.

In order to quantify the expected error of the hypothetical network

A^{- s}

, we can be optimistic and say that the error would not be greater than the error incurred by misclassifying the smallest group of bricks in the catalogue. In other words, if all slopes are erroneously classified because slopes were not in the training set, it would mean that the remaining 520 − 106 = 414 types are correctly classified. The smallest such group in the catalogue is that of plates with 22 instances. Then, if those 22 classes are not part of the training set, 22 × 6 (poses) × 6 (colours) × 72 (orientations) = 9504 images are misclassified out of the possible 1,468,800 images, or 0.647%. Thus, an estimate of overall test error is

G_{A^{- s}} < 0.006488

. It should be clear that this is a very optimistic estimate.

4.6. Part I Summary

Having looked at five different cases of selection bias in training sets, we summarize the above analysis in Table 1. The hypothetical networks

G_{A^{- c}}

,

G_{A - z}

,

G_{A^{- p}}

, and

G_{A^{- s}}

all have orders of magnitude too great an error given the production standards. As should be clear, if the training set is biased with respect to any characteristic except orientation in the imaging plane or non-primary colours, the resulting network is unlikely to generalize so that the required performance standard can be met. The particular values for generalization error may invite argument to be certain. However, there are two indisputable facts that emerge. First, the error for these four networks is far beyond the acceptable manufacturing values; it is not small nor negligible. Second, there are at least a few training biases that cannot be generalized away. LEGO^® bricks are a simple domain, one that can be completely characterized. Imagine how such a thought experiment might be impacted by a more complex domain.

PART II

5. Empirical Counterpart

To provide practical evidence of the arguments presented in the previous sections, in this section we will empirically verify the conclusions from the thought experiment. Since conducting the experiments with real world data requires a very large number of data with exact measurements (e.g., same lighting, exact poses and orientations, etc.), we will utilize rendering software to create the required images with photorealistic quality. We will then train a number of state-of-the-art image classification networks on the resulting images. For each experiment, we exclude some specific attribute value (e.g., a specific colour or an orientation) from the training set. Here a clarification should be made about the different sub-sets of data used in the experiment. It is common practice to divide up the data into a test set, a training set, and a validation set. In these experiments, however, the data will be divided as follows:

Training set: This is a biased dataset used to train the models. In each experiment, the training set is biased by excluding certain values of a specific attribute, for example, an experiment might exclude a particular colour, shape, or orientation. As a result, the training data does not represent the full range of variation for that attribute.
Control set: A subset of the training set is set aside as a control set to evaluate generalization under the same bias. These samples are excluded from the training process but share the same distribution as the training set. This allows us to isolate the effect of the training-test gap by ensuring any performance drop is not simply due to overfitting or memorization.
Test set: The test set contains samples with attribute values that are entirely missing from the training set. For instance, these samples might include shapes, colours, or poses that the model never encountered during training. These samples are considered out-of-distribution (OOD) relative to the biased training set.

5.1. Dataset—Official LEGO Models

The experiments require a dataset that spans a full range of possible values for a number of attributes (colour, pose, orientation, etc.) which would be intractable to gather in real-world setting. Instead, we set out to replicate them by rendering the needed dataset with a photorealistic quality.

We use official LEGO models from the ldraw [78] library and render them using the Blender software (version 2.79) [79] and ImportLDraw [80] library. There are total of 1076 models in the ldraw parts library; however, many models in the repository are very similar with minor differences. For the purposes of our experiments, it is best if the models are generally well distinguishable so that the neural network models can discriminate them from one another. To tackle this issue, we use Structural Similarity Index (SSIM) by Wang et al. [81] as a similarity measure of models and filter out those which are too similar. Specifically, we cluster similar brick models together and choose one model from each cluster at random. This procedure results in 837 final LEGO brick models. Figure 7a shows a number of these rendered images.

We use Cycles engine in the Blender software with 200 sample iterations per image to generate each image. We find that using 200 sample iterations provides images with an acceptable quality. We also use a plastic bump and texture map for a more natural plastic look.

Each model is rendered with 21 colours, which, except for the black and the white colour, are selected from the HSV colour space such that there is equal distance between their hues, so as to cover the full range of hue space equally. The list of colours can be found in the Table 2. Figure 7b shows one of the brick models in 6 of the colours.

The experiment also calls for images of each model at different viewing angles. Since many LEGO models are symmetric, we limit the span of camera movement to an octant of sphere (an eighth of the sphere) starting with the camera facing the brick’s side and moving 90 degrees to one side and upwards, with steps of 10 degrees, resulting in 81 different angles in total (the 90 degrees turn itself is not included). Another degree of freedom is the pose of each block, i.e., whether the block is lying face up, face down, or on one of the sides (for more details on difference of pose vs. orientation please refer to Figure 8a and Figure 9b). With 3 poses, 81 angles, and 21 colour for each class of LEGO bricks, we obtain 5103 images. Overall, this process produces 4,266,108 images. A brick viewed from different angles can be seen in Figure 9a.

The Size Specific Dataset

The size experiment is somewhat different from other experiments in that it also requires control on the number of studs of the bricks with the same shape. To achieve this, we also construct and include a set of rectangular bricks with different shapes. The models are first built in the ldraw format and then rendered using blender similar to the official LEGO bricks in the previous section. Figure 8b shows some of these models (Both datasets can be downloaded at https://data.nvision2.eecs.yorku.ca/LegoDataset/ (accessed on 8 September 2025). The code for rendering the datasets can be found at https://gitlab.nvision2.eecs.yorku.ca/brouhani/lbdg/-/tree/master (accessed on 8 September 2025)).

5.2. Experiments

We conduct experiments on three representative models: YOLOv8 [82], Vgg16 [83], and vision transformer [34]. The experimental setup depends on which particular feature is being tested. We choose these networks because of their popularity. Vgg16 is a very well-known network, which has been widely used and studied [84,85]. YOLOv8 was the most recent and up to date version of the YOLO framework at the time of this experiment. And vision transformers have been the building block and subject of many recent research works [86,87,88]. For vision transformer we use the implementation from https://github.com/lucidrains/vit-pytorch (accessed on 17 September 2025). We use the official PyTorch and Ultralytics implementations for Vgg and YOLO, respectively.

The models are trained with Stochastic Gradient Descent (SGD) with a learning rate of

0.0005

, weight decay of

0.05

, and momentum of

0.9

. The training phase is carried out until there is no increase in accuracy or drop in training loss for 1000 iterations (the code for our experiments can be found at https://gitlab.nvision2.eecs.yorku.ca/brouhani/selection_bias (accessed on 8 September 2025)). We allocate a maximum of 20 epochs for each training phase; however, in all experiments the models meet the early stopping criteria before reaching this limit. We use batch-size 32 for all models. For training YOLOv8 and Vgg16, we use one Nvidia Gefore GTX 1080 Ti, for ViT we use two gpus.

5.2.1. Colour Experiment

For the colour experiment, the training set includes all the bricks which have a hue range of 1.4 to 2.5 (on the scale of 0 to

2 π

, in the HSV colour space). The test set includes the bricks with the HSV hues ranging from 0 to 1.4 and from 2.5 to

2 π

. Moreover, 10% of the images in the training set are randomly chosen for the control set. We end up with 2,149,383 images in the training set, 1,592,135 images in the test set, and 238,819 images in the control set. The colours selected for the test set are mostly colours close to red. We choose this range for the test set since red is both a primary colour and one of the RGB channels.

5.2.2. Orientation Experiment

There are 81 different possible orientations for each brick, consisting of the combinations of 9 degrees of granularity for the polar angle

ϕ

and 9 for the azimuthal angle

ψ

for the spherical coordinates of the camera with respect to the brick (with a fixed radius). For the test set, we choose all the orientations with both

ψ

and

ϕ

values between 30 and 50 (inclusive), which will result in 9 out of 81 different orientations. We choose these values so the network will need to interpolate instead of extrapolating when dealing with the test set (later experiments show that these networks perform better when the excluded orientations are not boundary cases). Every data point in which the orientation either has a

ψ

or

ϕ

less than 30 or greater than 50 is included in the train set or control set. The control set is created by randomly selecting 10% of the training set.

The dataset also includes different poses for each brick (i.e., 0, 90, and 180, see Figure 9a,b for the distinction between pose and orientation). This poses a problem for isolating specific orientations from training data. Looking at a brick from orientation

α

might be equivalent to looking at it from a different orientation

β

if the pose has changed. To avoid these complications, we leave out all images with poses of 90 and 180 from all datasets. Finally, we are left with 147,419 images in the test set, 1,061,423 images in the training set, and 117,935 image in the control set.

5.2.3. Pose Experiments

A brick’s pose refers to how it is positioned on the conveyor belt, e.g., upside-down, on the side, etc. Figure 9b shows these different poses.

It should be noted that since both pose and orientation pertain to relative position of camera and brick, and also due to the symmetry of the bricks, different combinations of poses and orientations can produce the same images (e.g., there is no difference between rotating the brick along the z axis 90 degrees or moving the camera around it for 90 degrees). However, we limit the range of orientations and poses such that there are no duplicate combinations of orientations and images. Our dataset contains images of bricks in 3 different poses, upright, on one side, and upside-down (also denoted by degrees of rotation, 0, 90, and 180, respectively). The training set for the pose experiment includes all the images with the poses 0 and 180 (upright and upside-down). The test set includes all the images with the 90-degree pose (bricks laying on one side). Moreover, 10% of the train images are randomly chosen for the control set. There are 2,388,203 images in the training set, 1,326,779 images in the test set, and 265,355 images in the validation set.

5.2.4. Shape Experiment

In this experiment, we train the network on a training set which is missing a subset of classes, then fine tune it on the full dataset. The aim is to measure how well the network can classify new classes without having been trained on features specific to those classes. The training set contains 90% of the classes; the test set contains the remaining 10% of classes. We also set 10% of the training set aside as the control set. The training set in this case contains 2,869,343 images; the test set and the control set contain 792,179 and 318,815 images each, respectively.

5.2.5. Size Experiment

Similar to the shape experiment, in this experiment we train the network on the training set which does not include a subset of classes, then fine-tune it to measure how well the network will learn to classify the new classes using the biased features. In this setup, the classes excluded from the training set are the LEGO shapes with a specific size. Specifically, the sizes of bricks start from

3 \times 3

,

3 \times 4

, … and go up to

7 \times 8

and

8 \times 8

, so there are 21 different sizes in total. We exclude the bricks sizes of

5 \times 5

,

5 \times 6

,

5 \times 7

, and

5 \times 8

. The training set, test set, and control set contain 57,543, 12,222, and 6392 images each, respectively. It should be noted that this experiment is conducted on the size-specific dataset (see Section 5.2.5).

5.3. Experiment Results

Table 3 shows the results of our experiments. It can be observed that with the exception of the orientation scenario, in all other cases there is usually a considerable gap between the accuracy of the training and the test set. Moreover, it can be seen that the scenarios rank similarly across all three neural networks in terms of drop in accuracy, i.e., in general all networks find generalizing to new samples in some specific tasks more difficult than others. Namely, the colour and pose scenario expectedly demonstrate the largest gap between the training set and the test set. This is expected since in the case of the colour experiment, the channels which hold the information for classifying the bricks are biased and Convolutional Neural Networks are not channel invariant.

It is also interesting to observe that even though in the size experiment the bias is almost entirely due to the structural differences (i.e., the size of the brick and number of rows or columns of the studs) and not the more local features such as the specific curvatures of bricks, the models are unable to generalize to the new samples.

In the case of the orientation experiment, there is no gap between test and control set, one explanation might be that there are many different brick models in the dataset, each containing different angles and curves in its structure, which makes isolating a specific angle from the training set more difficult (compared to isolating a colour or a size, for example). To test whether this hypothesis is true, we carry out the same orientation experiment on the size dataset. The difference is that the size dataset only contains right angles between edges of bricks, and therefore excluding an angle from training set results in an actual bias in the training process. We test this on YOLOV8 since it had the best performance on the test set and observe a 23% gap (66% vs. 99% for test and control set, respectively) in accuracy between the biased and unbiased accuracies when tested, confirming our hypothesis.

5.4. Part II Summary

In this section we empirically tested conclusions of the thought experiments of Part I. To this end, we synthesize a dataset so that we can have full control on the colour, pose, orientation, shape, and size of the bricks. In each experiment, we train the model on a dataset that is missing a part of the data distribution and test the model on a dataset that includes the previously excluded images. Our datasets contain a training set (the biased dataset, the model is trained on this data), a control set (also biased dataset but reserved for testing), and a test set (the unbiased dataset to measure how well the network generalizes). The thought experiment predicted that the accuracy of the models will drop when tested on data which they have not been exposed to during the training process. In our experiments, we observe that overall, the models behave as we had concluded with the thought experiment. Although different types of bias cause gaps in performance to different extents, in almost all cases they do cause a gap. Figure 10 shows the performance of models on the control set vs. the test set. A model’s performance would be on the identity line when it has the same accuracy on both datasets. If a model is above the identity line, it means that it performs better on the control set than the test set. As is evident from Figure 10, orientation is the only exception where the models perform slightly better on the control set; however, it should be noted that there can be different configurations of the orientation scenario, and in cases where a larger portion of dataset is chosen to be excluded from the dataset, a drop in performance might result.

6. General Discussion

The previous sections attempt to show that generalization in learned networks may not be what it seems. We will first address potential issues regarding our assumptions, approach, and analysis methods.

6.1. Methodology and Assumption

We have sought to answer the question: If a network is trained with a dataset that misses particular values of some defining domain attribute, can it generalize to the full domain from which that training dataset was extracted while maintaining its performance accuracy? This is different than what many others have considered regarding the generalization behaviour of learned networks. Rather than conduct a deep mathematical analysis or extensive empirical analysis, neither of which has provided an answer, we tried a different approach, a Thought Experiment. The strategy we used is Proof by Contradiction via Backcasting, both well-known and widely used methods. Backcasting involves establishing the description of a very definite and very specific future situation. This specific situation involved the assumption that our question was in fact answered in the positive, that learned networks effectively generalize even when training biases are present and that they effectively bridge such biases. We then defined a real, yet simple, domain on which we would test this future situation, that of LEGO^® bricks. This seems to be a novel twist to the overall problem because it ties real engineering performance bounds on the network’s performance, something not often seen in the previous literature.

Regarding the details, we made a number of assumptions involving the LEGO^® domain, and since all were made to represent the lower magnitudes of any counts, the overall set of such assumptions can be considered generous towards a negative result to our thought experiment. We assumed that the learned systems A and

A^{i}

work perfectly and that training error equals zero. We also assumed that regardless of the training regimen or network configuration, each network includes all the classes that would be required to compose any user query (specifically, classes for each brick sub-type, each pose type, each orientation, and each colour).

It is certainly true that in vision, no current training set fully captures all dimensions of the representation of visual information. So, this has direct relevance to computer vision systems, and likely beyond to other domains. Our thought experiment unequivocally points to the conclusion that any generalization behaviour is completely dependent on the acceptable system error and on the nature of the domain data not represented in the training set.

Finally, it would be easy to argue that for the LEGO^® domain, the space of possible images is small enough so that a well-designed neural network would perform very well, or even that there is now enough computer power cheaply available to enable a brute-force search solution. Both these points are true and in practice perhaps this is exactly how LEGO^® might develop a practical system for this purpose. It is just as valid to think that there could be many engineering workarounds or manipulations for problems with how object pose or orientation might appear on a conveyor belt. Examples abound on modern manufacturing lines (tracking guides, slots, use of gravity, friction or puffs of air for positioning, etc.). But this was not the point of our exercise. We had no desire to present a realizable design for a production system. The point was to select a real and complex task, whose scale is manageable for complete representation, in order to test the limits of generalization in the face of training biases and performance requirements.

6.2. Cases Outside the Brick Binning Task

The argument of Section 5 assumes that only one value of some dimension is missing. What if the training bias encompasses some combination of attributes? For example, for some bricks a particular colour may be missing while for others some orientation is missing. Or perhaps there are no red bricks in the training set and some 3D poses are also missing. Worse still, suppose there are no yellow bricks and no slopes in the training set. It is straightforward to devise many more combinations. For some the impact might be small while for others large. It seems apparent that there are many combinations where no apparent satisfactory behaviour would be observed in terms of LEGO^®’s production standards.

The LEGO^® brick application was discussed with a few assumptions to simplify the hypothetical system. In a different real world setting when arbitrary kinds of objects might be involved in normal scenes, these assumptions will not be valid. First of all, there would not be a single well-isolated object in a scene; there would be many, likely with interactions among them. Changes in viewpoint of the observer or in general the overall geometry of the image formation process [89], shadows, or highlights and how they change with illumination direction and surface properties [90], accidental alignments of object features [68], and more will need to be considered. A detailed analysis of each of these complexities is beyond our scope; however, just for illustrative purposes, we consider the viewpoint change case for a set of assembled bricks, an object more complex than a single brick but still not quite as complex as an arbitrary real-world object.

Suppose we are presented with a construction of ‘standard’ LEGO^® bricks (all edges are orthogonal to the others), say on a LEGO^® plate of 10 × 24 studs, and the plate is presented with some random 3D pose to the observer. Let us ignore for the moment the impact of the studs on the brick surfaces. A simple enough task for an observer might be to simply count the number of bricks used in the construction. For our hypothetical scenario above, suppose a customer wanted 1000 random constructions of this form each with exactly K bricks. Assuming a random brick construction engine, these items might pass under the same camera on a conveyor belt and a system to count the bricks would then decide which random construction to give to this particular customer. As mentioned earlier, Kirousis & Papadimitriou [74] provide an effective algorithm for labelling all the visible lines and vertices of orthohedral scenes so the recognition of each visible brick for counting purposes (only) is assumed solved. However, it is not difficult to see that if a brick is occluded by another, some or all of its lines and vertices would not be labelled and thus it could not be counted unless the viewpoint changes. The incidence of occlusions depends on the stable poses of the brick construction. One, perhaps the most, stable pose is for the supporting plate to be flat on the conveyor belt. Occlusions would still occur if the construction included bricks that are partially supported and they overhang perhaps above other bricks. Two other stable positions can be formed where one of the the long ends (there are two long ends) of the plate is on the belt but the overall construction is tipped over with some other contact point and with the edge forming a stable pose. Here, occlusions would be frequent. As in the case of the previously described network

A^{- P}

, where one 3D pose is left out of the training set, no amount of training could fill in the structure that cannot be observed. Active observation has been shown to be critical for many real-world visual tasks (for example by Anderson [50] and Bajcsy et al. [91]). It is not difficult to imagine a wide variety of tasks outside the ones mentioned where similar problems would arise. It is important to note that for many, solutions could be easily engineered to avoid the problems; but that was not the point of our analysis. Our point was to ask questions about generalization, and in that respect, we conclude that great care needs to be applied when making assumptions about how a trained network can or cannot generalize.

6.3. Limits on Generalization

It should be clear that no neural network can make a provably undecidable problem decidable, nor a provably intractable problem tractable (such as the NP-complete problems described in Section 4.5). For such problems, the only recourse is approximation. The main question then is how far such an approximation would be from the production requirements of some particular application, such as our hypothetical brick binning task. For example, in the missing pose case of Section 4.4, it would not be unexpected that a full 3D brick might be classified even if the network was not trained on one possible pose; an interpolation would be likely that fills in any gaps. But how good is that interpolation in practice. The same is true for the missing colour case of Section 4.1; it is certain that some colour mixture might be output by the network, but how close is it to the required one? Modern applications employing learned networks for some domains approach high-90% accuracy levels, quite an accomplishment to be sure. However, our brick binning task needs several orders of magnitude better performance. Many applications involving human life and safety require even lower error bounds. In other applications, such as face recognition for photo-sharing, or casual language translation, errors have virtually no associated costs (from a life and safety perspective), so our analysis may seem not so relevant. Further, the proofs in Kratsios & Bilokopytov’s work [92] and others make it appear as if the approximation error can be arbitrarily small (but non-zero). However, it seems a great challenge to approach this theoretical limit given how the reported empirical errors of the latest neural networks are many orders of magnitude higher than those for our brick binning task. These are empirical questions that cannot be answered here but are certainly important for future study.

6.4. Problems with Data Augmentation

As has been mentioned a few times so far, the set of techniques collectively known as data augmentation, a data-space solution to the problem of limited data, are commonly used to supplement the data assembled for training sets in learned systems (for reviews see surveys by Van Dyk & Meng [93] and Shorten & Khoshgoftaar [1], among others). Such techniques enhance the size and quality of training datasets leading to better systems. This is realized under the assumption that more information can be extracted from the original dataset through augmentations. These augmentations artificially inflate the training dataset size by either data warping or oversampling. Many other strategies for increasing generalization performance focus on the model’s architecture itself leading to progressively more complex architectures [1]. Other specific functions for this problem include functional dropout regularization, batch normalization, transfer learning, and pretraining (e.g., Kukacka et al. [94]; Hernandez-Garcia & Konig [95]). This paper focuses on visual data; the image augmentation algorithms include geometric transformations, colour space augmentations, kernel filters, mixing images, random erasing, feature space augmentation, adversarial training, generative adversarial networks, neural style transfer, and meta-learning [1]. The augmented data is intended to represent a more complete sampling of possible data points, thus decreasing the distance between the training and validation set, as well as test sets.

All of these have proved themselves empirically to some degree but they all feature three issues. First, these are not principled approaches with respect to the actual data sampling deficiencies. Second, the data augmentation methods have secondary effects which seem ignored but which are potentially detrimental. Finally, they are not targeted at a domain’s performance specification, they simply seek to improve generalization in some neutral manner. Consider the first of these issues.

Any data set that purports to adequately represent some domain in order for a learning system to extract the relevant statistical regularities important for that domain must actually include those features in a statistically discoverable manner. All of the methods increase the size of the training data via manipulations of the existing data in the belief that the enhanced data set better samples the domain. It is puzzling that these are not based on an analysis of the data set in order to discover exactly what might be missing, that is, what features are not statistically well represented.

Consider next the issue of secondary effects. The augmented image set may help generalization, but the added samples have the potential secondary effect of changing the balance of representation for other features. Take a simple example. In order to increase rotation invariance as a performance criterion, an existing image set might be augmented by rotated versions of those images (see study by Engstrom et al. [69], where this is also discussed). The artificially rotated images, however, change the distribution of lighting source directions across the full set of images. This may not be relevant for many applications, and of course humans have little difficulty with this in general (but are not immune to it; e.g., ‘hollow face illusion’ [96]). But for those applications where the learned system is part of an agent that needs to make inferences about lighting direction (to interpret shadows or depth cues from surface shading, etc.), this has the potential of improving performance in one variable while damaging that of another as a secondary effect. The same would be true for augmentation via symmetry manipulations; again the imaging geometry changes. Another manipulation involves simple spatial shifts in order to enhance position invariance or image padding to ameliorate boundary effects. This changes the context within which target objects are found, and as seen in work by Rosenfeld et al. [71], context is very important and can have quite detrimental impact on accuracy of classification. Learning methods learn not only the required target objects but also the context within which they are found. Much more discussion on such effects appears in the survey by Shorten & Khoshgoftaar [1].

These secondary effects will have a ripple effect across the full set of features a domain requires. Each manipulation although targeted to some specific feature, necessarily changes sampling of other feature dimensions. Without targeting direct knowledge of the training set sampling deficiencies, the third issue mentioned above, data augmentation may improve global performance in an indirect manner and not necessarily performance along a particular feature dimension of importance. We should also point out that these image manipulation methods rarely go out of the 2D image plane; objects in the real world are three-dimensional, and none of these methods can possibly augment in the third dimension let alone how 3D imaging geometry (light sources, camera viewpoint and settings, object position relative to illumination and camera, ambient or reflected illumination, or surface properties) affect appearance (although see the paper by Goodfellow et al. [70] where the assumptions of sufficient view data and object surface continuity seems to permit new views, but not with accompanying illumination changes).

6.5. Bias in Datasets

In Section 1.2, the experimental biases described by Sackett [48] were very briefly discussed, ending with the suggestion that these, with suitable domain translations, might be as valid for computer vision systems (as well as AI in general) as for medical data. Certainly the potential for bias is very real. Consider the number of possible training sets in the visual domain. Recall the estimate of number of discernible images presented earlier, 1.5 × 1025. For the sake of argument, suppose that the number of samples used for training is 107 (the scale ImageNet represents as mentioned earlier). How many subsets of 107 images are possible? This is given by

(1.5 \times 10^{25}!) / (10^{7}! (1.5 \times 10^{25} - 10^{7})!)

, not so easy to evaluate, but certainly impossibly large to be certain which of these possible training sets are free of bias. Although it might be easy to search in a brute-force manner for a training set that yields incrementally better performance numbers than others, this is not the point. Bias in that training set remains unaddressed. Debate over Pavlidis’ estimate will not really lead to a better result because even if that estimate is reduced to

10^{10}

, the number of possible subsets of a billion elements remains insanely large (For amusement, try any of the online permutation/combination calculators (e.g., https://www.calculator.net/permutation-and-combination-calculator.html (accessed on 30 April 2022)) and ask how many combinations without replacement there are of 1000 elements out of 10,000; the count has 1410 digits!).

Even so, many shrug off such combinatorial arguments as not relevant. After all, humans are able to understand any image they see on the internet and many feel that machines should also have this ability. Such a casual conclusion arises from a lack of knowledge; this ignores the fact that humans do this at different speeds and with different performance levels (see Caroll’s [97] comprehensive presentation of human visuospatial abilities; see Itti et al.’s [98] encyclopedic treatment of the breadth of characteristics seen in human and primate attentional behaviour, among others). Reaching such a casual conclusion is entirely unjustified, and if it forms a pillar of the empirical method for the field, there is an important need for re-evaluation. Certainly, the phrase ‘sees like humans’ which is very frequently attached to descriptions of modern computer vision research needs to be re-considered. This is especially true if the computer system outperforms humans. To see like a human, from an external observer’s point of view, should mean that the accuracy is similar, that the time course to a decision is similar, that the number and kinds of errors made are similar, and that any additional external behaviour (eye movements for example) are similar. In general, these are not reported, and it is unlikely these are true for modern systems. It is a great goal to build a machine that can ‘see like humans’, but one must be clear about what this actually entails. By eye alone, humans may not match engineering performance specifications such as those described here.

It seems intractable and infeasible to attempt an empirical examination of all potential sources and impacts of bias when developing a training and test set of data. However, as we have shown, a procedure like our thought experiment might be a way to begin addressing the issue. Tied to a domain-specific performance target, it seems straightforward to extend the reasoning described here in order to determine the impact of many types of bias. This might be time-consuming and tedious, but it is tractable. Rasouli & Tsotsos [99] have shown that it is possible to develop a substantial (not necessarily complete but very useful nonetheless) set of factors relevant to how drivers deal with pedestrians. A similar analysis could be conducted for other domains.

It may be useful to require that the creation of any training or test set be accompanied by a Bias Breakdown, that is, an analysis of the spectrum of possible biases that is explicit regarding which biases are relevant and which are not, which can be dealt with by good sample choices and which can not, and what any impacts on final system performance might be due to the remaining biases.

This Bias Breakdown can be organized easily. One suggestion for a template could be the following. For each source of bias—the 56 biases that Sackett describes might be too much, but it is a good starting point—one might complete Table 4 below:

In our thought experiment, as already described, we focussed on selection bias, and within Sackett’s sub-classes, missing data bias fits best. Completing this single dimension of the table yields the Table 5 below:

This should be considered a starting point and experience will determine how such a Bias Breakdown may evolve. Note that it is important to know for each application what the acceptable performance requirements might be and to tie the goal of generalization to it. For safety-critical applications, the need for a clear, as complete as feasible, documentation of biases in training data and their impacts (or biases in any other aspect of system development, similar to the manner Sackett describes bias across the full research enterprise) cannot be over-emphasized.

7. Conclusions

Our thought experiment was motivated by the desire to understand the nature of generalization in learned computer vision systems. At some basic level, this understanding should not depend on the exact characteristics of the system architecture or learning method. We sought to understand generalization at a more fundamental level. Specifically, we considered the impact of selection bias on generalization, because in vision, a training set is a very tiny subset of the full population of all possible images. We employed a well-understood methodology in order to challenge the limits of generalization, Reductio ad Absurdum (proof via contradiction) via Backcasting. Our analysis indeed found contradictions. These were mostly due to physics, to geometry, or to computational intractability for which no amount of cleverness in an algorithm can compensate. Empirical testing of our hypotheses confirmed the theoretical results. Sampling bias may be hidden and silent. It is hidden because very few research efforts verify that a domain is appropriately sampled to create the training set and silent because the errors it causes are seen only if appropriate test cases are used (even though many have tried to raise the point, e.g., Akhtar & Mian [100]). Any learning system is agnostic as to what is missing when trained, whether it be values of attributes or whole attribute dimensions. One needs to be methodical and principled in dataset construction and statistically valid regarding sampling with respect to all the variations of the data, data augmentation, and generalization expectations. Our analysis also found that generalization errors would exceed, by orders of magnitude, the engineering requirements of our toy (yet not so ‘toy’) domain. Our thought experiment points to three conclusions. First, that generalization behaviour is dependent on how sufficiently the particular dimensions of the domain are represented during training. Second, that the utility of any generalization is completely dependent on the acceptable system error. Third, that specific visual features of objects, such as pose orientations out of the imaging plane or colours, may not be recoverable if not explicitly represented sufficiently in a training set. It may be that more principled design of not only training sets but of the whole empirical protocol for learned systems is necessary. Current learned systems and their underlying theory provide no guarantee that a given system will be able to cover all the relevant dimensions, categories, or compositions unless they are included and tested accordingly. It may be that a new unification of classical methods in computer vision and AI with current machine learning methods, with each paying more attention to the variability caused by object pose, viewpoint, and more generally the full imaging and lighting geometry, will yield the best of both. Whereas empirical confirmation may be the best method for confirming generalization effectiveness, it is a time and resource expensive activity. Our Thought Experiment Probe approach, coupled with the resulting Bias Breakdown, is far less costly and can be very informative as a first step towards understanding the impact of biases. Further, it is can be important supplementary documentation component for training data intended for systems with critical performance requirements.

Author Contributions

Conceptualization, J.K.T.; methodology, J.K.T. and B.R.; software, B.R.; validation, J.K.T. and B.R.; formal analysis, J.K.T.; investigation, B.R. and J.K.T.; resources, J.K.T. and B.R.; data curation, B.R.; writing—original draft preparation, J.K.T.; writing—review and editing, J.K.T. and B.R.; visualization, J.K.T. and B.R.; supervision, J.K.T.; project administration, J.K.T.; funding acquisition, J.K.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Air Force Office of Scientific Research under award numbers FA9550-18-1-0054 and FA9550-22-1-0538 (Computational Cognition and Machine Intelligence, and Cognitive and Computational Neuroscience portfolios); the Canada Research Chairs Program (Grant Number 950-231659); Natural Sciences and Engineering Research Council of Canada (Grant Numbers RGPIN-2016-05352 and RGPIN-2022-04606).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The LEGO dataset is available at https://data.nvision2.eecs.yorku.ca/LegoDataset/ (accessed on 8 September 2025). All the codes for the experiments in this paper are available at https://gitlab.nvision2.eecs.yorku.ca/brouhani/selection_bias (accessed on 8 September 2025).

Acknowledgments

The authors wish to thank Ershad Banijamali, Iuliia Kotseruba, Amir Rasouli, Amir Rosenfeld, Brian Cantwell Smith, Sven Dickinson, Konstantine Tsotsos, and especially Jun Luo for their helpful comments and suggestions on earlier drafts of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shorten, C.; Khoshgoftaar, T. A survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Xu, H.; Ma, Y.; Liu, H.; Deb, D.; Liu, H.; Tang, J.; Jain, A. Adversarial attacks and defenses in images, graphs and text: A review. Int. J. Autom. Comput. 2020, 17, 151–178. [Google Scholar] [CrossRef]
Bartlett, P.L.; Maass, W. Vapnik-Chervonenkis dimension of neural nets. In The Handbook of Brain Theory and Neural Networks; MIT Press: Cambridge, MA, USA, 2003; pp. 1188–1192. [Google Scholar]
Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; Vinyals, O. Understanding deep learning requires rethinking generalization. arXiv 2016, arXiv:1611.03530. [Google Scholar]
Keskar, N.S.; Mudigere, D.; Nocedal, J.; Smelyanskiy, M.; Tang, P.T.P. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv 2016, arXiv:1609.04836. [Google Scholar]
Neyshabur, B.; Bhojanapalli, S.; McAllester, D.; Srebro, N. Exploring generalization in deep learning. Adv. Neural Inf. Process. Syst. 2017, 30, 5949–5958. [Google Scholar]
Kawaguchi, K.; Kaelbling, L.P.; Bengio, Y. Generalization in deep learning. arXiv 2017, arXiv:1710.05468. [Google Scholar]
Wu, L.; Zhu, Z. Towards understanding generalization of deep learning: Perspective of loss landscapes. arXiv 2017, arXiv:1706.10239. [Google Scholar] [CrossRef]
Novak, R.; Bahri, Y.; Abolafia, D.A.; Pennington, J.; Sohl-Dickstein, J. Sensitivity and generalization in neural networks: An empirical study. arXiv 2018, arXiv:1802.08760. [Google Scholar] [CrossRef]
Fetaya, E.; Jacobsen, J.H.; Grathwohl, W.; Zemel, R. Understanding the Limitations of Conditional Generative Models. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Wang, H.; Wu, X.; Huang, Z.; Xing, E. High-frequency component helps explain the generalization of convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8684–8694. [Google Scholar]
Zhang, C.; Zhang, M.; Zhang, S.; Jin, D.; Zhou, Q.; Cai, Z.; Liu, Z. Delving deep into the generalization of vision transformers under distribution shifts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7277–7286. [Google Scholar]
Wang, M.; Ma, C. Generalization error bounds for deep neural networks trained by sgd. arXiv 2022, arXiv:2206.03299. [Google Scholar]
Galanti, T.; Xu, M.; Galanti, L.; Poggio, T. Norm-based generalization bounds for sparse neural networks. Adv. Neural Inf. Process. Syst. 2023, 36, 42482–42501. [Google Scholar]
Zhao, C.; Sigaud, O.; Stulp, F.; Hospedales, T. Investigating generalisation in continuous deep reinforcement learning. arXiv 2019, arXiv:1902.07015. [Google Scholar] [CrossRef]
Barbu, A.; Mayo, D.; Alverio, J.; Luo, W.; Wang, C.; Gutfreund, D.; Tenenbaum, J.; Katz, B. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Adv. Neural Inf. Process. Syst. 2019, 32, 9453–9463. [Google Scholar]
Afifi, M.; Brown, M.S. What else can fool deep learning? Addressing color constancy errors on deep neural network performance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 243–252. [Google Scholar]
Senhaji, A.; Raitoharju, J.; Gabbouj, M.; Iosifidis, A. Not all domains are equally complex: Adaptive Multi-Domain Learning. arXiv 2020, arXiv:2003.11504. [Google Scholar] [CrossRef]
Bengio, Y.; Gingras, F. Recurrent neural networks for missing or asynchronous data. Adv. Neural Inf. Process. Syst. 1995, 8, 395–401. [Google Scholar]
Van Buuren, S. Flexible Imputation of Missing Data; CRC Press: Boca Raton, FL, USA, 2018. [Google Scholar]
Lin, W.C.; Tsai, C.F. Missing value imputation: A review and analysis of the literature (2006–2017). Artif. Intell. Rev. 2020, 53, 1487–1509. [Google Scholar] [CrossRef]
Śmieja, M.; Struski, L.; Tabor, J.; Zieliński, B.; Spurek, P. Processing of missing data by neural networks. Adv. Neural Inf. Process. Syst. 2018, 31, 2719–2729. [Google Scholar]
Chai, X.; Gu, H.; Li, F.; Duan, H.; Hu, X.; Lin, K. Deep learning for irregularly and regularly missing data reconstruction. Sci. Rep. 2020, 10, 3302. [Google Scholar] [CrossRef]
Zhou, K.; Yang, Y.; Qiao, Y.; Xiang, T. MixStyle Neural Networks for Domain Generalization and Adaptation. Int. J. Comput. Vis. 2024, 132, 822–836. [Google Scholar] [CrossRef]
García-Laencina, P.J.; Sancho-Gómez, J.L.; Figueiras-Vidal, A.R. Pattern classification with missing data: A review. Neural Comput. Appl. 2010, 19, 263–282. [Google Scholar] [CrossRef]
Polikar, R.; DePasquale, J.; Mohammed, H.S.; Brown, G.; Kuncheva, L.I. Learn++. MF: A random subspace approach for the missing feature problem. Pattern Recognit. 2010, 43, 3817–3832. [Google Scholar] [CrossRef]
Fridovich-Keil, S.; Bartoldson, B.; Diffenderfer, J.; Kailkhura, B.; Bremer, T. Models out of line: A fourier lens on distribution shift robustness. Adv. Neural Inf. Process. Syst. 2022, 35, 11175–11188. [Google Scholar]
Simsek, B.; Hall, M.; Sagun, L. Understanding out-of-distribution accuracies through quantifying difficulty of test samples. arXiv 2022, arXiv:2203.15100. [Google Scholar] [CrossRef]
Hendrycks, D.; Basart, S.; Mu, N.; Kadavath, S.; Wang, F.; Dorundo, E.; Desai, R.; Zhu, T.; Parajuli, S.; Guo, M.; et al. The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 10–17 October 2021; IEEE Computer Society: Washington, DC, USA, 2021; pp. 8320–8329. [Google Scholar]
Hendrycks, D.; Zhao, K.; Basart, S.; Steinhardt, J.; Song, D. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15262–15271. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Ridnik, T.; Ben-Baruch, E.; Noy, A.; Zelnik-Manor, L. Imagenet-21k pretraining for the masses. arXiv 2021, arXiv:2104.10972. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Alcorn, M.A.; Li, Q.; Gong, Z.; Wang, C.; Mai, L.; Ku, W.S.; Nguyen, A. Strike (with) a pose: Neural networks are easily fooled by strange poses of familiar objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4845–4854. [Google Scholar]
Zong, Y.; Yang, Y.; Hospedales, T. MEDFAIR: Benchmarking fairness for medical imaging. arXiv 2022, arXiv:2210.01725. [Google Scholar]
Larrazabal, A.J.; Nieto, N.; Peterson, V.; Milone, D.H.; Ferrante, E. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proc. Natl. Acad. Sci. USA 2020, 117, 12592–12594. [Google Scholar] [CrossRef]
Drenkow, N.; Pavlak, M.; Harrigian, K.; Zirikly, A.; Subbaswamy, A.; Farhangi, M.M.; Petrick, N.; Unberath, M. Detecting dataset bias in medical AI: A generalized and modality-agnostic auditing framework. arXiv 2025, arXiv:2503.09969. [Google Scholar] [CrossRef]
Chen, I.; Johansson, F.D.; Sontag, D. Why is my classifier discriminatory? In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2018; Volume 31. [Google Scholar]
Seyyed-Kalantari, L.; Liu, G.; McDermott, M.; Chen, I.Y.; Ghassemi, M. CheXclusion: Fairness gaps in deep chest X-ray classifiers. In Biocomputing 2021: Proceedings of the Pacific Symposium; World Scientific: Singapore, 2020; pp. 232–243. [Google Scholar]
Li, X.; Chen, Z.; Zhang, J.M.; Sarro, F.; Zhang, Y.; Liu, X. Bias behind the wheel: Fairness testing of autonomous driving systems. ACM Trans. Softw. Eng. Methodol. 2025, 34, 1–24. [Google Scholar] [CrossRef]
Fernández Llorca, D.; Frau, P.; Parra, I.; Izquierdo, R.; Gómez, E. Attribute annotation and bias evaluation in visual datasets for autonomous driving. J. Big Data 2024, 11, 137. [Google Scholar] [CrossRef]
Buolamwini, J.; Gebru, T. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Proceedings of the Conference on Fairness, Accountability and Transparency, New York, NY, USA, 23–24 February 2018; pp. 77–91. [Google Scholar]
Kortylewski, A.; Egger, B.; Schneider, A.; Gerig, T.; Morel-Forster, A.; Vetter, T. Empirically analyzing the effect of dataset biases on deep face recognition systems. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2093–2102. [Google Scholar]
Kortylewski, A.; Egger, B.; Schneider, A.; Gerig, T.; Morel-Forster, A.; Vetter, T. Analyzing and reducing the damage of dataset bias to face recognition with synthetic data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019; pp. 2261–2268. [Google Scholar]
Balding, D.J. A tutorial on statistical methods for population association studies. Nat. Rev. Genet. 2006, 7, 781–791. [Google Scholar] [CrossRef] [PubMed]
Gordis, L. Epidemiology; Elsevier Health Sciences: Amsterdam, The Netherlands, 2014. [Google Scholar]
Sackett, D. Bias in analytic research. J. Chronic Dis. 1979, 32, 51–63. [Google Scholar] [CrossRef] [PubMed]
Grimes, D.A.; Schulz, K.F. Bias and causal associations in observational research. Lancet 2002, 359, 248–252. [Google Scholar] [CrossRef]
Anderson, C. The end of theory: The data deluge makes the scientific method obsolete. Wired Magazine, 23 June 2008. [Google Scholar]
Tsotsos, J.; Kotseruba, I.; Andreopoulos, A.; Wu, Y. A Possible Reason for why Data-Driven Beats Theory-Driven Computer Vision. arXiv 2019, arXiv:1908.10933. [Google Scholar]
Marr, D. Vision: A Computational Investigation into the Human Representation and Processing of Visual Information; MIT Press: Cambridge, MA, USA, 1982. [Google Scholar]
Pavlidis, T. The number of all possible meaningful or discernible pictures. Pattern Recognit. Lett. 2009, 30, 1413–1415. [Google Scholar] [CrossRef]
Andreopoulos, A.; Tsotsos, J.K. On sensor bias in experimental methods for comparing interest-point, saliency, and recognition algorithms. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 34, 110–126. [Google Scholar] [CrossRef] [PubMed]
Wu, Y.; Tsotsos, J. Active control of camera parameters for object detection algorithms. arXiv 2017, arXiv:1705.05685. [Google Scholar] [CrossRef]
Tsotsos, J. The Complexity of Perceptual Search Tasks. In Proceedings of the International Joint Conference on Artificial Intelligence, Detroit, MI, USA, 20–25 August 1989; pp. 1571–1577. [Google Scholar]
Ahmad, S.; Tresp, V. Some solutions to the missing feature problem in vision. Adv. Neural Inf. Process. Syst. 1992, 5, 393–400. [Google Scholar]
Boen, J. BrickGun Storage Labels. Available online: https://brickgun.com/Labels/BrickGun_Storage_Labels.html (accessed on 30 April 2022).
Diaz, J. Everything You Always Wanted to Know About LEGO®. 2009. Available online: https://gizmodo.com/everything-you-always-wanted-to-know-about-LEGO-5019797 (accessed on 23 April 2025).
Robinson, J.B. Futures under glass: A recipe for people who hate to predict. Futures 1990, 22, 820–842. [Google Scholar] [CrossRef]
Xu, M.; Bai, Y.; Ghanem, B.; Liu, B.; Gao, Y.; Guo, N.; Olmschenk, G. Missing Labels in Object Detection. In Proceedings of the CVPR Workshops, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Koenderink, J.J. Color for the Sciences; MIT Press: Cambridge, MA, USA, 2010. [Google Scholar]
Rumelhart, D.E.; McClelland, J.L.; Hinton, G.E.; PDP Research Group. Distributed representations. In Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations; MIT Press: Cambridge, MA, USA, 1986; Chapter 3. [Google Scholar]
Rasouli, A.; Tsotsos, J.K. The effect of color space selection on detectability and discriminability of colored objects. arXiv 2017, arXiv:1702.05421. [Google Scholar] [CrossRef]
Cutler, B. Aspect Graphs Under Different Viewing Models. 2003. Available online: http://people.csail.mit.edu/bmcutler/6.838/project/aspect_graph.html#22 (accessed on 30 April 2022).
Plantinga, H.; Dyer, C.R. Visibility, occlusion, and the aspect graph. Int. J. Comput. Vis. 1990, 5, 137–160. [Google Scholar] [CrossRef]
Rosenfeld, A. Recognizing unexpected objects: A proposed approach. Int. J. Pattern Recognit. Artif. Intell. 1987, 1, 71–84. [Google Scholar] [CrossRef]
Dickinson, S.J.; Pentland, A.; Rosenfeld, A. 3-D shape recovery using distributed aspect matching. IEEE Trans. Pattern Anal. Mach. Intell. 1992, 14, 174–198. [Google Scholar] [CrossRef]
Engstrom, L.; Tran, B.; Tsipras, D.; Schmidt, L.; Madry, A. A rotation and a translation suffice: Fooling CNNs with simple transformations. arXiv 2017, arXiv:1712.02779. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar]
Rosenfeld, A.; Zemel, R.; Tsotsos, J.K. The elephant in the room. arXiv 2018, arXiv:1808.03305. [Google Scholar] [CrossRef]
Clowes, M.B. On seeing things. Artif. Intell. 1971, 2, 79–116. [Google Scholar] [CrossRef]
Waltz, D.L. Understanding line drawings of scenes with shadows. In The Psychology of Computer Vision; Winston, P.H., Horn, B.K.P., Eds.; McGraw–Hill: New York, NY, USA, 1975; pp. 19–91. [Google Scholar]
Kirousis, L.M.; Papadimitriou, C.H. The complexity of recognizing polyhedral scenes. J. Comput. Syst. Sci. 1988, 37, 14–38. [Google Scholar] [CrossRef]
Parodi, P. The Complexity of Understanding Line Drawings of Origami Scenes. Int. J. Comput. Vis. 1996, 18, 139–170. [Google Scholar] [CrossRef]
Sugihara, K. Mathematical structures of line drawings of polyhedrons. IEEE Trans. Pattern Anal. Mach. Intell. 1982, 4, 458–469. [Google Scholar] [CrossRef]
Malik, J. Interpreting line drawings of curved objects. Int. J. Comput. Vis. 1987, 1, 73–103. [Google Scholar] [CrossRef]
LDraw.org. LDraw (Version LDraw.org Parts Update 2023-05). 2023. Available online: https://www.ldraw.org (accessed on 20 December 2023).
Blender Studio. Blender. Version 2.79. Available online: https://www.blender.org/ (accessed on 20 December 2023).
Nelson, T. ImportLDraw: A plugin for the LDraw library in Blender, Version 1.1.10. 2019. Available online: https://github.com/TobyLobster/ImportLDraw (accessed on 17 September 2025).
Wang, Z.; Bovik, A.; Sheikh, H.; Simoncelli, E. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Jocher, G.; Chaurasia, A.; Qiu, J.; Ultralytics YOLO Version 8. Ultralytics. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 1 October 2023).
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the ICLR, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Gavrilescu, R.; Zet, C.; Foșalău, C.; Skoczylas, M.; Cotovanu, D. Faster R-CNN: An approach to real-time object detection. In Proceedings of the 2018 International Conference and Exposition on Electrical and Power Engineering (EPE), Iasi, Romania, 18–19 October 2018; pp. 0165–0168. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12009–12019. [Google Scholar]
Chen, X.; Wang, X.; Changpinyo, S.; Piergiovanni, A.; Padlewski, P.; Salz, D.; Goodman, S.; Grycner, A.; Mustafa, B.; Beyer, L.; et al. PaLI: A Jointly-Scaled Multilingual Language-Image Model. In Proceedings of the the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Faugeras, O. Three-Dimensional Computer Vision: A Geometric Viewpoint; MIT Press: Cambridge, MA, USA, 1993. [Google Scholar]
Horn, B. Robot Vision; MIT Press: Cambridge, MA, USA, 1986. [Google Scholar]
Bajcsy, R.; Aloimonos, Y.; Tsotsos, J.K. Revisiting active perception. Auton. Robot. 2018, 42, 177–196. [Google Scholar] [CrossRef]
Kratsios, A.; Bilokopytov, I. Non-euclidean universal approximation. Adv. Neural Inf. Process. Syst. 2020, 33, 10635–10646. [Google Scholar]
van Dyk, D.; Meng, X.L. The Art of Data Augmentation. J. Comput. Graph. Stat. 2001, 10, 1–50. [Google Scholar] [CrossRef]
Kukačka, J.; Golkov, V.; Cremers, D. Regularization for deep learning: A taxonomy. arXiv 2017, arXiv:1710.10686. [Google Scholar] [CrossRef]
Hernández-García, A.; König, P. Data augmentation instead of explicit regularization. arXiv 2018, arXiv:1806.03852. [Google Scholar]
Hill, H.; Bruce, V. Independent effects of lighting, orientation, and stereopsis on the hollow-face illusion. Perception 1993, 22, 887–897. [Google Scholar] [CrossRef] [PubMed]
Carroll, J.B. Human Cognitive Abilities: A Survey of Factor-Analytic Studies; Number 1; Cambridge University Press: Cambridge, UK, 1993. [Google Scholar]
Itti, L.; Rees, G.; Tsotsos, J.K. Neurobiology of Attention; Elsevier: Amsterdam, The Netherlands, 2005. [Google Scholar]
Rasouli, A.; Tsotsos, J.K. Autonomous vehicles that interact with pedestrians: A survey of theory and practice. IEEE Trans. Intell. Transp. Syst. 2019, 21, 900–918. [Google Scholar] [CrossRef]
Akhtar, N.; Mian, A. Threat of adversarial attacks on deep learning in computer vision: A survey. IEEE Access 2018, 6, 14410–14430. [Google Scholar] [CrossRef]

Figure 1. Representation of Backcasting (from https://en.wikipedia.org/wiki/Backcasting, (accessed on 30 April 2022)).

Figure 2. Example of bricks for Section 4.2. Panels (a–d) have 4, 1, 2 and 3 studs respectively. Panels (e–h) showcase different possible shapes and sizes.

Figure 3. Example of a typical LEGO^® element—an elongated triangle block—within an aspect representation. The eye is in an approximate position from where the view of the block would be only a rectangle. The aspect seen is shown with a blue outline; the view from the eye is shown with red dotted lines.

Figure 4. Examples of bricks for Section 4.4. Panels (a,b) demonstrate bricks with unique characteristics. Panels (c–f) show examples of objects for which degenerate views are possible.

Figure 5. First set of examples of bricks for Section 4.5. Panels (a–d) are considered plates while panels (e–h) are considered slopes.

Figure 6. Second set of examples of bricks for Section 4.5. Panels (a–e) depict bricks of different shape including slopes (e) and plates (a,b).

Figure 7. (a) Some of the rendered LEGO^® brick models; (b) one of the brick models in 6 different colours.

Figure 8. (a) Different orientations of the brick by rotation along z axis (top) and x axis (bottom). (b) Some different sizes of the size-specific dataset.

Figure 9. (a) When an orientation is chosen to be excluded from the training set, the four neighbouring orientations are also included in the excluded set. (b) The different poses a brick can have for two different bricks.

Figure 10. Performance of networks trained with biased data. The Y-axis shows the performance on the control set, where the bias is similar to training set; and the X-axis shows the performance on test set, where the bias is not present. If generalization of a network trained with biased data could cover those biases, each of the coloured shapes here would lie on the identity line (dashed line).

Table 1. Summary of training bias analysis.

Training Set Bias	Learned Network	Generalization Error
Unbiased	A	$G_{A} < 0.000018$
Colour Biased	$A^{- c}$ , $c \in (r e d, b l u e, y e l l o w)$	$G_{A^{- c}} < 0.166685$
	$A^{- c}$ , $c \in (g r e e n, b l a c k, w h i t e)$	$G_{A^{- c}} < 0.000018$
Size Biased	$A^{- z}$	$G_{A^{- z}} < 0.018602$
Orientation Biased	$A^{- o}$	$G_{A^{- o}} < 0.000018$
3D Pose Biased	$A^{- p}$	$G_{A^{- p}} < 0.120018$
Shape Biased	$A^{- s}$	$G_{A^{- s}} < 0.006488$

Table 2. The hexadecimal colour code used to render the models; the colour white with the hexadecimal code ffffff is also included in the dataset but absent from the table above.

⏹ff36ff	⏹ 3636ff	⏹ 36ffff	⏹ 36ff36	⏹ ffff36
⏹ ff8082	⏹ ff80ff	⏹ 8080ff	⏹ 80ffff	⏹ 80ff80
⏹ ffff80	⏹ 800002	⏹ 800080	⏹ 000080	⏹ 008080
⏹ 008000	⏹ 808000	⏹ ff3333	⏹ 808080	⏹ 000000

Table 3. Performance of networks trained with biased data. For each network and each attribute, we compare performance on the test, train and control data. If generalization of a network trained with biased data could cover those biases, the test and control performances should be close in value.

	VGG 16			YOLOv8			ViT
	Test	Train	Ctrl	Test	Train	Ctrl	Test	Train	Ctrl
Size	97	97	97	78	90	92	55	85	92
Shape	68	80	80	63	77	77	77	87	87
Orientation	79	73	73	80	69	70	97	96	92
Pose	8	90	90	4	64	64	3	89	89
Colour	33	84	82	20	63	62	2	59	58

Table 4. Suggested Bias Breakdown template for proposed datasets.

Bias Class

Bias dimensions evaluated

Interpretation within current domain

Bias not applicable by design

Evaluated impact on performance requirements

Bias dimensions not evaluated

Estimated impact of non-evaluated biases

Table 5. Table 4 completed for our experiment, with the missing data bias.

Bias Class	missing data
Bias dimensions evaluated	size, shape, orientation, colour, 3D pose
Interpretation within current domain	there are no training samples representing a particular value of one or more of these dimensions
Bias not applicable by design	the following are assumed constant across all training/test samples: imaging geometry, lighting, occlusion
Evaluated impact on performance requirements	See Table 1 above
Bias dimensions not evaluated	conjunctions of size, shape, orientation, colour, 3D pose; low sample size of some dimension
Estimated impact of non-evaluated biases	less predictable: No samples of multiple dimensions likely will increase error; Low sample sizes of a single dimension may lead to unpredictably less error; Low sample sizes of multiple dimensions likely to increase error above that for lowest error on single dimension with zero samples.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rouhani, B.; Tsotsos, J.K. Effects of Biases in Geometric and Physics-Based Imaging Attributes on Classification Performance. J. Imaging 2025, 11, 333. https://doi.org/10.3390/jimaging11100333

AMA Style

Rouhani B, Tsotsos JK. Effects of Biases in Geometric and Physics-Based Imaging Attributes on Classification Performance. Journal of Imaging. 2025; 11(10):333. https://doi.org/10.3390/jimaging11100333

Chicago/Turabian Style

Rouhani, Bahman, and John K. Tsotsos. 2025. "Effects of Biases in Geometric and Physics-Based Imaging Attributes on Classification Performance" Journal of Imaging 11, no. 10: 333. https://doi.org/10.3390/jimaging11100333

APA Style

Rouhani, B., & Tsotsos, J. K. (2025). Effects of Biases in Geometric and Physics-Based Imaging Attributes on Classification Performance. Journal of Imaging, 11(10), 333. https://doi.org/10.3390/jimaging11100333

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Effects of Biases in Geometric and Physics-Based Imaging Attributes on Classification Performance

Abstract

1. Introduction

1.1. Current Thinking About Generalization in Learned Systems

1.2. The Issue of Bias in Data

1.3. Setting up the Problem

1.4. The Approach

2. The Domain of LEGO Bricks

2.1. The LEGO Brick World

2.2. Detection and Binning Scenario

3. The Thought Experiment: Setup

4. The Thought Experiment: Cases Considered

4.1. Training Set Missing One Brick Colour

4.2. Training Set Missing One Brick Size

4.3. Training Set Missing One Brick Orientation

4.4. Training Set Missing One Brick 3D Pose

4.5. Training Set Missing One Brick Shape

4.6. Part I Summary

5. Empirical Counterpart

5.1. Dataset—Official LEGO Models

The Size Specific Dataset

5.2. Experiments

5.2.1. Colour Experiment

5.2.2. Orientation Experiment

5.2.3. Pose Experiments

5.2.4. Shape Experiment

5.2.5. Size Experiment

5.3. Experiment Results

5.4. Part II Summary

6. General Discussion

6.1. Methodology and Assumption

6.2. Cases Outside the Brick Binning Task

6.3. Limits on Generalization

6.4. Problems with Data Augmentation

6.5. Bias in Datasets

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI