Deep Regression Neural Networks for Proportion Judgment

: Deep regression models are widely employed to solve computer vision tasks, such as human age or pose estimation, crowd counting, object detection, etc. Another possible area of application, which to our knowledge has not been systematically explored so far, is proportion judgment. As a prerequisite for successful decision making, individuals often have to use proportion judgment strategies, with which they estimate the magnitude of one stimulus relative to another (larger) stimulus. This makes this estimation problem interesting for the application of machine learning techniques. In regard to this, we proposed various deep regression architectures, which we tested on three original datasets of very different origin and composition. This is a novel approach, as the assumption is that the model can learn the concept of proportion without explicitly counting individual objects. With comprehensive experiments, we have demonstrated the effectiveness of the proposed models which can predict proportions on real-life datasets more reliably than human experts, considering the coefﬁcient of determination (>0.95) and the amount of errors ( MAE < 2, RMSE < 3). If there is no signiﬁcant number of errors in determining the ground truth, with an appropriate size of the learning dataset, an additional reduction of MAE to 0.14 can be achieved. The used datasets will be publicly available to serve as reference data sources in similar projects.


Introduction
People have the ability to distinguish between non-symbolic numerical magnitudes without counting, which is derived from the approximate number system (ANS) [1]. At the same time, various tasks require people to estimate ratios and proportions, comparing the magnitudes of two quantities [2]. Determining the proportion of open flowers to all the flower buds and flowers on the plant, estimating the ratio between a marked area and the total area of an image, and judging the share of a certain object in relation to the total number of objects on an image all serve as everyday examples. In this paper, we focus on the proportion judgment, where, by definition [2], an observer estimates the magnitude of one stimulus relative to another, larger stimulus, and based on that responds with a value between 0 and 1 (or between 0% and 100%). Proportion judgment can be seen as a special case of ratio judgment, where the observer estimates the ratio of two stimulus magnitudes. It should also be mentioned that, in everyday life, the terms proportion, ratio, fraction, or percentage are used interchangeably.
Although proportion estimation can be an important prerequisite for decision-making (for example, plant protection with chemical or biological products, within optimal timelines), numerous studies have shown that bias is systematically present in the assessment.
Several authors have shown that small proportions are usually overestimated, and large proportions underestimated [3,4]. Notably, there are fewer studies that discuss the reverse pattern, i.e., underestimation of small proportions and overestimation of large proportions. Therefore, the problem of proportion estimation represents a suitable area for the application of artificial intelligence (AI) techniques.
In computer vision, regression techniques can be applied in many fields, such as crowd counting, pose estimation, facial landmark detection, age and demographic analysis estimation, or image registration [5]. Nowadays, the convolutional neural network (CNN) is considered to be one of the best learning algorithms for understanding image content and has shown exemplary performance in a number of applications [6]. The common mode of implementation is deep regression, i.e., CNN with a (linear) regression top layer.
For example, the authors in [7] proposed deep regression forests for age estimation. In [8], the authors used a ResNet-based deep regression model to learn the optimal repulsive pose for the safe collaboration between humans and robots. Deep learning methods modified for regression problems [9] were also applied to estimate gross tonnage, as a nonlinear measure of a ship's overall internal volume. Deng et al. [10] used a deep regression framework based on manifold learning for manufacturing quality prediction. In [11], authors proposed a part-to-target tracker based on a deep regression model. Zhong et al. [12] applied an attention-guided deep regression architecture for cephalometric landmark detection. Single poultry tracking is demonstrated in [13], using a deep regression architecture based on the Alexnet network.
Wang et al. [14] proposed a deep regression framework for automatic pneumonia screening, which jointly learns the multi-channel images and multi-modal information to simulate the clinical pneumonia screening process. In [15], hierarchical deep regression with a network designed for hierarchical semantic feature extraction is used for traffic congestion detection, as an important aspect of vehicular management. The authors proposed in [16] the use of a regression convolutional neural network to find the 3-D position of arbitrarily oriented subjects or anatomy in a canonical space based on slices or volumes of medical images. With these examples, it becomes clear that deep regression algorithms are used in a wide variety of applications in very different domains.
There are many different approaches to this topic, as shown by the examples of applying machine learning or other similar techniques in the field of land cover classification. Such examples include plant communities, crops, and fractional vegetation cover estimation [17][18][19][20][21][22][23]. Other areas of application are very broad; for example, these methods could be applied in geology and rock fraction estimation [24], biology [25], or medicine [26]. One of the few examples of the application of deep learning algorithms is [27], where the authors propose a model for proportion estimation for urban mixed scenes. In the proposed framework, the feature extraction capabilities of deep learning are used to obtain the fully connected layer features, after which a scene-unmixing framework based on nonnegative matrix factorization (NMF) is applied to estimate the mixing ratio.
Even with all these examples, to our knowledge, there is still no systematic analysis of the real possibilities of deep learning algorithms in the general area of proportion judgment. This is precisely the focus of this paper.
The following sections of the manuscript are organized as follows: Section 2 discusses the experimental protocols, as well as the origin, nature, and properties of the datasets used in this paper. Section 3 is devoted to presenting experimental results for each of the base architectures and datasets, before presenting a summary and overall discussion in Section 4. Conclusions are given in Section 5.

Datasets
We performed the experiments using three very different datasets, in order to test the proposed hypothesis in varying environments. Two datasets constitute our original contribution, while the third dataset is our adaptation of a publicly available dataset.
The first is a toy dataset, which consists of artificially generated images showing a random number of triangles and quadrilaterals. The second dataset consists of images showing parts of an olive tree canopy during multiple flowering phenophases. Finally, Future Internet 2022, 14, 100 3 of 16 the third dataset was derived from publicly available aerial images from a number of geographic areas, accompanied by some segmentation results.
The datasets consist of a significantly different number of examples (10,000, 1314, and 18,000, respectively), and they also include images of different sizes (1024 × 1024 px, 256 × 256 px, and 250 × 250 px, respectively). In this way, the performance of individual algorithms in the context of different input data can be compared. For all datasets, we used an 80%-0%-10% train-validation-test split.

Toy Dataset (TOYds)
The artificially generated dataset consists of 10,000 RGB images. Each image is 1024 × 1024 pixels in size and contains a random number of triangles and quadrilaterals of different colors and sizes. All the quadrilaterals are convex, with interior angles that measure less than 150 degrees each. For each image, the share of triangles in the total number of objects was calculated, which will represent the ground truth during the experiment. Figure 1 shows a sample image, as well as the distribution of the share (percentage) of triangles in all images in the dataset. Images contain between 8 and 62 objects (triangles or quadrilaterals), including between 0 and 60 triangles, respectively. In other words, the share of triangles is between 0% and 100%.
The first is a toy dataset, which consists of artificially generated images showing a random number of triangles and quadrilaterals. The second dataset consists of images showing parts of an olive tree canopy during multiple flowering phenophases. Finally, the third dataset was derived from publicly available aerial images from a number of geographic areas, accompanied by some segmentation results.
The datasets consist of a significantly different number of examples (10,000, 1314, and 18,000, respectively), and they also include images of different sizes (1024 × 1024 px, 256 × 256 px, and 250 × 250 px, respectively). In this way, the performance of individual algorithms in the context of different input data can be compared. For all datasets, we used an 80%-0%-10% train-validation-test split.

Toy Dataset (TOYds)
The artificially generated dataset consists of 10,000 RGB images. Each image is 1024 × 1024 pixels in size and contains a random number of triangles and quadrilaterals of different colors and sizes. All the quadrilaterals are convex, with interior angles that measure less than 150 degrees each. For each image, the share of triangles in the total number of objects was calculated, which will represent the ground truth during the experiment. Figure 1 shows a sample image, as well as the distribution of the share (percentage) of triangles in all images in the dataset. Images contain between 8 and 62 objects (triangles or quadrilaterals), including between 0 and 60 triangles, respectively. In other words, the share of triangles is between 0% and 100%. To check the impact that the size of the dataset has on the performance of the selected models, an additional dataset with 25,000 examples was generated on the same principles (TOY*ds).

Olive Flowering Phenophases Dataset (OFPds)
Images in this dataset show olive canopies during different stages of flowering. We derived a total of 1314 images, 256 × 256 pixels in size, from a dataset we collected in an olive grove in southern Croatia [28].  To check the impact that the size of the dataset has on the performance of the selected models, an additional dataset with 25,000 examples was generated on the same principles (TOY*ds).

Olive Flowering Phenophases Dataset (OFPds)
Images in this dataset show olive canopies during different stages of flowering. We derived a total of 1314 images, 256 × 256 pixels in size, from a dataset we collected in an olive grove in southern Croatia [28].  It should be emphasized that open flowers may visually vary to a great degree, and that sometimes they are not easy to spot, depending on the angle and distance of the camera, the lighting, objects obstructing the view of the flowers, and other conditions.  As had been expected, the number of samples in this dataset proved to be insufficient during the experiment, as the original dataset with 1000 images was insufficient to successfully carry out the learning phase. This is why we used data augmentation to generate new artificial learning examples: We applied various geometric image distortion techniques at random, such as translating, image rotation, zooming, and color modifications. This resulted in an increase in the size of the training dataset to 8509 images.

Aerial Image labeling Dataset (AILds)
This dataset is derived from the Inria aerial image labeling dataset, used in [29]. Original images are 5000 × 5000 pixels in size. The authors divided the semantic classes of this dataset into "building" and "not building". To achieve their goal, the authors had to extract building footprints from the cadaster, which resulted in a semantic segmentation mask for each image. It should be emphasized that open flowers may visually vary to a great degree, and that sometimes they are not easy to spot, depending on the angle and distance of the camera, the lighting, objects obstructing the view of the flowers, and other conditions. It should be emphasized that open flowers may visually vary to a great degree, an that sometimes they are not easy to spot, depending on the angle and distance of the cam era, the lighting, objects obstructing the view of the flowers, and other conditions.  As had been expected, the number of samples in this dataset proved to be insufficien during the experiment, as the original dataset with 1000 images was insufficient to suc cessfully carry out the learning phase. This is why we used data augmentation to generat new artificial learning examples: We applied various geometric image distortion tech niques at random, such as translating, image rotation, zooming, and color modification This resulted in an increase in the size of the training dataset to 8509 images.

Aerial Image labeling Dataset (AILds)
This dataset is derived from the Inria aerial image labeling dataset, used in [29]. Orig inal images are 5000 × 5000 pixels in size. The authors divided the semantic classes of th dataset into "building" and "not building". To achieve their goal, the authors had to ex tract building footprints from the cadaster, which resulted in a semantic segmentatio mask for each image.
During the preprocessing phase, the original image and its corresponding segmenta tion mask were divided into smaller subimages (250 × 250 pixels). In the next step, w As had been expected, the number of samples in this dataset proved to be insufficient during the experiment, as the original dataset with 1000 images was insufficient to successfully carry out the learning phase. This is why we used data augmentation to generate new artificial learning examples: We applied various geometric image distortion techniques at random, such as translating, image rotation, zooming, and color modifications. This resulted in an increase in the size of the training dataset to 8509 images.

Aerial Image labeling Dataset (AILds)
This dataset is derived from the Inria aerial image labeling dataset, used in [29]. Original images are 5000 × 5000 pixels in size. The authors divided the semantic classes of this dataset into "building" and "not building". To achieve their goal, the authors had to extract building footprints from the cadaster, which resulted in a semantic segmentation mask for each image.
During the preprocessing phase, the original image and its corresponding segmentation mask were divided into smaller subimages (250 × 250 pixels). In the next step, we used the black-and-white mask to calculate the percentage of the image area occupied by the buildings. In the end, a total of 18,000 images were available for the experiment.
An example of an image from the dataset is shown in Figure 4, as well as the corresponding segmentation mask. The distribution of the percentages of the image area occupied by the buildings is shown in Figure 5. used the black-and-white mask to calculate the percentage of the image area occupied by the buildings. In the end, a total of 18,000 images were available for the experiment.
An example of an image from the dataset is shown in Figure 4, as well as the corresponding segmentation mask. The distribution of the percentages of the image area occupied by the buildings is shown in Figure 5.

Methodology and Architectures
To prove the hypothesis that CNNs can successfully learn and interpret the concept of proportions for a very wide range of datasets, we tested a number of diverse architectures:  used the black-and-white mask to calculate the percentage of the image area occupied by the buildings. In the end, a total of 18,000 images were available for the experiment.
An example of an image from the dataset is shown in Figure 4, as well as the corresponding segmentation mask. The distribution of the percentages of the image area occupied by the buildings is shown in Figure 5.

Methodology and Architectures
To prove the hypothesis that CNNs can successfully learn and interpret the concept of proportions for a very wide range of datasets, we tested a number of diverse architectures: General-purpose networks (e.g., VGG-19, Xception, InceptionResnetV2, etc.) modified for regression tasks; • General-purpose networks in transfer learning mode, modified for regression tasks; • Hybrid architectures (The CNN works as a trainable feature extractor, while the machine learning algorithm (e.g., SVR) performs as a regressor); • Deep ensemble models for regression.

Methodology and Architectures
To prove the hypothesis that CNNs can successfully learn and interpret the concept of proportions for a very wide range of datasets, we tested a number of diverse architectures: Although the majority of the models could be further adapted to the corresponding dataset with minor changes in the architecture, no additional adjustments were made for better comparison possibilities.
CNN training was implemented with the Keras [30] and TensorFlow [31] deep learning frameworks. We used a workstation equipped with an AMD Ryzen Threadripper 3960X CPU and NVIDIA GeForce RTX 3090 with 24 GB memory, with Linux Ubuntu 20.04 as the used OS.
An early stopping and a model checkpoint were used as a callback function. Early stopping interrupts the training process if there is no improvement of the validation loss after a defined number of epochs (with a default early stopping patience set to 15 epochs). The model checkpoint is used to save the best model if and once the validation loss decreases.

Vanilla Deep Regression
In this set of experiments, we compared the vanilla deep regression model (custom CNN with a regression top layer) with other models that are partly or entirely based on established algorithms. The architecture is based on VGG-16 and VGG-19 architectures [32], where the size of the applied network is the result of numerous experiments, including the grid-search used to find the optimal hyperparameters of a model. The values of the notable hyperparameters are as follows: The number of epochs is 100; mini-batch size is 32; learning rate is 0.001; Adam optimizer [33] and early stopping patience 15. For comparison purposes, we later used the same hyperparameter settings for all proposed models. It should be noted that other optimizers were also tested, where the best performance was achieved by Adam, alternating with RMSprop [34] in the first place for individual datasets and models.
The layers configuration of the vanilla deep regression model is as follows: where Conv2D(n) denotes the 2D convolution layer with n filters, ACT() denotes the activation function, BN denotes the batch normalization layer, MP denotes the max pooling layer, GAP denotes a 2D global average pooling layer, and FL denotes a flatten layer and DN(n) denotes a dense layer. Since a proportion is the comparison of a part to the whole, it can have a value ranging from 0 to 1 (i.e., between 0% and 100%). Therefore, the original ReLU (rectified linear unit) activation function is modified (ReLU_100) as follows:

General-Purpose Networks
We tested several general purpose CNNs, where fully connected top layer structures were adjusted for regression. Finally, we chose three algorithms since they can be considered representatives of this group of algorithms: VGG-19 [32], Xception [35], and InceptionResnetV2 [36]. Several different top layer configurations were tested, where the choice was finally narrowed down to the following configurations: By comparing the models' performance on all three datasets, the first configuration was used in most cases.

General-Purpose Networks in Transfer Learning Mode
We also compared learning from scratch to the application of transfer learning [37], as an indispensable tool in situations with insufficient training data. The goal is to try to transfer the knowledge from the source domain to the target domain, reusing the part of the network that was pre-trained in the source domain. e.g., as a weight initialization scheme. ImageNet [38] trained features are the most popular starting point for transfer task fine-tuning. In [39], the authors concluded that there is still no definitive answer to the question "What makes ImageNet good for transfer learning?", but it is obvious that traditional CNN architectures can extract high-quality generic low/middle level features from an ImageNet dataset.
An additional useful feature is that we can freeze a certain part of the network. This is used primarily for preserving the low-level features that are built in the first layers of the network. During the training phase, the transferred weights can remain frozen at their initial values or trained together with the random weights (fine-tuning). As we used transfer learning from a completely different domain (ImageNet), we decided to fine-tune the layers instead of freezing them.
The previously mentioned models have an increased generalization ability across domains. Figure 6 shows that additional hyperparameter tuning can resolve overfitting and underfitting to a significant degree. By comparing the models' performance on all three datasets, the first was used in most cases.

General-Purpose Networks in Transfer Learning Mode
We also compared learning from scratch to the application of transfer as an indispensable tool in situations with insufficient training data. The go transfer the knowledge from the source domain to the target domain, reusi the network that was pre-trained in the source domain. e.g., as a weight scheme. ImageNet [38] trained features are the most popular starting poin task fine-tuning. In [39], the authors concluded that there is still no definit the question "What makes ImageNet good for transfer learning?", but it is traditional CNN architectures can extract high-quality generic low/middle from an ImageNet dataset.
An additional useful feature is that we can freeze a certain part of the n is used primarily for preserving the low-level features that are built in the the network. During the training phase, the transferred weights can remain f initial values or trained together with the random weights (fine-tuning). As w fer learning from a completely different domain (ImageNet), we decided to layers instead of freezing them.
The previously mentioned models have an increased generalization abil mains. Figure 6 shows that additional hyperparameter tuning can resolve ov underfitting to a significant degree.

Hybrid Architectures
In the proposed approach, we used the CNN's convolutional layers, trained ImageNet weights, to extract features which are used to train the mac regression algorithm.
In our experiment, we used bottleneck features to train the representati

Hybrid Architectures
In the proposed approach, we used the CNN's convolutional layers, with the pretrained ImageNet weights, to extract features which are used to train the machine learning regression algorithm.
In our experiment, we used bottleneck features to train the representatives of regression models, namely support vector regression (SVR) [40] and random forest regressor (RFR) [41]. The concept of the experiment (Xception + SVR variant) is shown in Figure 7.

Deep Ensemble Models for Regression
Ensemble methods can improve the predictive performance of a single model by training multiple models and combining their predictions [42]. Deep ensemble learning [43] combines deep learning models and ensemble learning so that the final model has a better generalization performance.
Still, extracting objects and details from images can be challenging due to their highly variable shape, size, color, and texture. To improve this, we proposed an ensemble model involving a multichannel CNN. Each channel is comprised of the input layer that defines the various sizes of input images, focusing on a particular scale. All channels share the standard CNN architecture in the transfer mode with the same set of filter parameters. The outputs from the three channels are concatenated and processed by dropout and dense layers (Figure 8). This architecture is expected to extract more robust features, i.e., to have greater resilience against large variations in object size.

Deep Ensemble Models for Regression
Ensemble methods can improve the predictive performance of a single model by training multiple models and combining their predictions [42]. Deep ensemble learning [43] combines deep learning models and ensemble learning so that the final model has a better generalization performance.
Still, extracting objects and details from images can be challenging due to their highly variable shape, size, color, and texture. To improve this, we proposed an ensemble model involving a multichannel CNN. Each channel is comprised of the input layer that defines the various sizes of input images, focusing on a particular scale. All channels share the standard CNN architecture in the transfer mode with the same set of filter parameters. The outputs from the three channels are concatenated and processed by dropout and dense layers (Figure 8). This architecture is expected to extract more robust features, i.e., to have greater resilience against large variations in object size.

Results
During the experiment, we tested hundreds of markedly different approaches and architectures, and the results we obtained with the selected typical use cases are presented further in this manuscript.
The performance of the predictions was evaluated using the coefficient of determination R 2 , root mean square error (RMSE), and mean absolute error (MAE) metrics: where yi is the ground-truth value, ŷi is the predicted data, ȳ is the mean of ground truth for all samples, and N is the number of testing samples. We can draw the following conclusions based on the experiments conducted with CNNs listed in Table 1: (1) for successful proportion estimation (MAE < 5), at least 10,000 examples are needed for a reasonable minimum dataset size; (2) a large training dataset can be created and prepared via data augmentation methods; (3) vanilla CNN optimized for regression, despite the relatively simple architecture and with fewer parameters than the VGG-19 model, can achieve acceptable results; (4) general-purpose networks (VGG-19, Xception, InceptionResnetV2, etc.), modified for regression tasks, perform approximately twice as well if pre-trained ImageNet weights are used.

Results
During the experiment, we tested hundreds of markedly different approaches and architectures, and the results we obtained with the selected typical use cases are presented further in this manuscript.
The performance of the predictions was evaluated using the coefficient of determination R 2 , root mean square error (RMSE), and mean absolute error (MAE) metrics: where y i is the ground-truth value,ŷ i is the predicted data,ȳ is the mean of ground truth for all samples, and N is the number of testing samples. We can draw the following conclusions based on the experiments conducted with CNNs listed in Table 1: (1) for successful proportion estimation (MAE < 5), at least 10,000 examples are needed for a reasonable minimum dataset size; (2) a large training dataset can be created and prepared via data augmentation methods; (3) vanilla CNN optimized for regression, despite the relatively simple architecture and with fewer parameters than the VGG-19 model, can achieve acceptable results; (4) general-purpose networks (VGG-19, Xception, InceptionResnetV2, etc.), modified for regression tasks, perform approximately twice as well if pre-trained ImageNet weights are used. We have shown that it is possible to improve the generalization of all proposed models with the controlled application of batch normalization [44] and dropout [45] techniques, but with caution and experimentation [46]. Table 2 shows the results of hybrid and ensemble models based on Xception and InceptionResNetV2 architectures in transfer learning mode. Based on these results, we can further extend the abovementioned conclusions as follows: (5) using deep bottleneck features to train a machine learning algorithm (SVR, RFR) does not result in better performance than standard CNNs; (6) hybrid models involving a multichannel CNN (Xception * 3, InceptionResNetV2 * 3) show that ensemble regression methods are effective tools that improve the results and generalization performance of simple deep regression algorithms. However, further assessment is needed for whether the improvement in performance, which rarely exceeds 10%, justifies the high number of parameters and computational cost. For example, the basic Xception algorithm (modified for regression, in transfer mode) has 66% fewer parameters than the multichannel Xception * 3 model, or 55% shorter duration of each epoch during the learning phase.

Summary and Discussion
The application of the proposed models on the olive flowering phenophases dataset shows that CNNs typically perform better than human experts, especially considering that, in the case of estimating thousands of images, this can be a mentally very demanding process.
The results should be also analyzed in the context of the reliability of determining the ground truth [47]. We used manual and automated methods to illustrate the importance of ground truth data design and use.
For the toy dataset, to guarantee the accuracy of the ground truth data, the exact share of triangles (ground truth) was computed during the image generation process. This fact, combined with 25,000 examples in the dataset, results in MAE values between 0.2 and 0.3 for the best models. However, we mentioned that the number of epochs (for all datasets) was limited during the experiment to a value of 100, in order to make the results comparable. With this in mind, after the experiment was finished, we tested some of the best-performing models further. We found that, if the maximum number of epochs is increased to 500 and early stopping patience to 30, an additional reduction of MAE to 0.14 can be achieved.
As already stated, for the olive flowering phenophases dataset, the ground truth is provided by human expert annotators. Generally speaking, flowers are defined as "open" when the reproductive parts are visible between or within unfolded or open flower parts. The application of this definition is not simple in practice, as can be seen in Figure 9. was limited during the experiment to a value of 100, in order to make the results comparable. With this in mind, after the experiment was finished, we tested some of the bestperforming models further. We found that, if the maximum number of epochs is increased to 500 and early stopping patience to 30, an additional reduction of MAE to 0.14 can be achieved.
As already stated, for the olive flowering phenophases dataset, the ground truth is provided by human expert annotators. Generally speaking, flowers are defined as "open" when the reproductive parts are visible between or within unfolded or open flower parts. The application of this definition is not simple in practice, as can be seen in Figure 9. To give an illustration of how this can be problematic in practice, we could say that there are a total of 20 buds and flowers in an image, but experts cannot agree on the classification of one particular flower. Thus, in this case, some of the experts would say that there are 10 open flowers, and others would say there are 11. Therefore, the percentages of open flowers are 50% and 55%, respectively. This means that the difference in estimate (ground truth) is 5%, just because of the differing classification of one flower. Even in this relatively simple example, the 5% difference is a noticeably larger percentage than the mistake percentages of the best-performing models.
Errors for the aerial image labeling dataset were also analyzed in detail ( Figure 10).  To give an illustration of how this can be problematic in practice, we could say that there are a total of 20 buds and flowers in an image, but experts cannot agree on the classification of one particular flower. Thus, in this case, some of the experts would say that there are 10 open flowers, and others would say there are 11. Therefore, the percentages of open flowers are 50% and 55%, respectively. This means that the difference in estimate (ground truth) is 5%, just because of the differing classification of one flower. Even in this relatively simple example, the 5% difference is a noticeably larger percentage than the mistake percentages of the best-performing models.
Errors for the aerial image labeling dataset were also analyzed in detail ( Figure 10). was limited during the experiment to a value of 100, in order to make the results comparable. With this in mind, after the experiment was finished, we tested some of the bestperforming models further. We found that, if the maximum number of epochs is increased to 500 and early stopping patience to 30, an additional reduction of MAE to 0.14 can be achieved.
As already stated, for the olive flowering phenophases dataset, the ground truth is provided by human expert annotators. Generally speaking, flowers are defined as "open" when the reproductive parts are visible between or within unfolded or open flower parts. The application of this definition is not simple in practice, as can be seen in Figure 9. To give an illustration of how this can be problematic in practice, we could say that there are a total of 20 buds and flowers in an image, but experts cannot agree on the classification of one particular flower. Thus, in this case, some of the experts would say that there are 10 open flowers, and others would say there are 11. Therefore, the percentages of open flowers are 50% and 55%, respectively. This means that the difference in estimate (ground truth) is 5%, just because of the differing classification of one flower. Even in this relatively simple example, the 5% difference is a noticeably larger percentage than the mistake percentages of the best-performing models.
Errors for the aerial image labeling dataset were also analyzed in detail ( Figure 10).  As previously mentioned, a mask representing building footprints from the cadaster is available for the automated calculation of the percentage of the image area occupied by buildings. It is reasonable to assume that the ground truth thus defined should be reliable. However, the analysis of the results showed that this approach also has weaknesses. As can be seen in Figure 10, the examples of the overlapped images show that sporadically there are discrepancies between cadastral maps and aerial photographs. The first problem is that sometimes there is an offset of the building footprints (Figure 10a), which does not necessarily affect the result unless the building is only partially shown in the figure.
A more significant problem arises in the case where there are buildings in the pictures that are not registered in the cadaster (Figure 10b). This results in instances with ground truth errors identified in training, validation and test datasets.
These are additional reasons why an MAE between 1 and 2 for the aerial image labeling dataset or MAE between 3 and 4 for the olive flowering phenophases dataset are considered to be excellent results. Figure 11 shows the scatter plot of predicted values and the ground truth for the aerial images test dataset (1800 samples) and Xception * 3 model. Ground truth values (blue) are pre-sorted. Rare major discrepancies are mainly due to erroneous ground truth data. As previously mentioned, a mask representing building footprints from the cadaster is available for the automated calculation of the percentage of the image area occupied by buildings. It is reasonable to assume that the ground truth thus defined should be reliable. However, the analysis of the results showed that this approach also has weaknesses. As can be seen in Figure 10, the examples of the overlapped images show that sporadically there are discrepancies between cadastral maps and aerial photographs. The first problem is that sometimes there is an offset of the building footprints (Figure 10a), which does not necessarily affect the result unless the building is only partially shown in the figure. A more significant problem arises in the case where there are buildings in the pictures that are not registered in the cadaster (Figure 10b). This results in instances with ground truth errors identified in training, validation and test datasets.
These are additional reasons why an MAE between 1 and 2 for the aerial image labeling dataset or MAE between 3 and 4 for the olive flowering phenophases dataset are considered to be excellent results. Figure 11 shows the scatter plot of predicted values and the ground truth for the aerial images test dataset (1800 samples) and Xception * 3 model. Ground truth values (blue) are pre-sorted. Rare major discrepancies are mainly due to erroneous ground truth data. Figure 11. Scatter plot of predicted values versus ground truth, for the aerial image labeling dataset, Xception * 3 model. Figure 12 shows an example of an error generated by the model itself. Building footprints from the cadaster (red) were indicated, but the model estimated that moored vessels (yellow) also represented buildings.  Figure 12 shows an example of an error generated by the model itself. Building footprints from the cadaster (red) were indicated, but the model estimated that moored vessels (yellow) also represented buildings.

Conclusions
Precise proportion judgment is positively correlated with su in a variety of decision tasks. It represents a particular type of ra smaller magnitude is compared to a larger one. Therefore, the au could provide significant support in different processes, whic knowledge, has not been systematically explored so far.
The experiments are designed to investigate in detail the deep regression architectures, by using three very different da cantly different areas of application. Two datasets constitute o while the third dataset is our adaptation of a publicly available d The performed experiments showed that the selected CNN portion judgment, predict proportions more reliably than hum without explicitly counting individual objects. Based on the res and considering the coefficient of determination (>0.95) and the 2, RMSE < 3), we concluded that, with sufficient data and the u could achieve highly acceptable results. Still, the main problem ground truth data.
The expanded toy dataset, with 25,000 examples and gua ground truth data, results in MAE values between 0.2 and 0.3 f the maximum number of epochs increased to 500, an additional can be achieved.
The two original datasets are a significant contribution of Figure 12. Aerial image with marked building footprints from cadaster (red) and moored vessels (yellow).

Conclusions
Precise proportion judgment is positively correlated with successful decision-making in a variety of decision tasks. It represents a particular type of ratio judgment in which a smaller magnitude is compared to a larger one. Therefore, the automation of this process could provide significant support in different processes, which is a topic that, to our knowledge, has not been systematically explored so far.
The experiments are designed to investigate in detail the possibilities of different deep regression architectures, by using three very different datasets that cover significantly different areas of application. Two datasets constitute our original contribution, while the third dataset is our adaptation of a publicly available dataset.
The performed experiments showed that the selected CNN models, adjusted for proportion judgment, predict proportions more reliably than human experts could, even without explicitly counting individual objects. Based on the results for the best models, and considering the coefficient of determination (>0.95) and the amount of errors (MAE < 2, RMSE < 3), we concluded that, with sufficient data and the use of transfer mode, we could achieve highly acceptable results. Still, the main problem remains the reliability of ground truth data.
The expanded toy dataset, with 25,000 examples and guaranteed reliability of the ground truth data, results in MAE values between 0.2 and 0.3 for the best models. With the maximum number of epochs increased to 500, an additional reduction of MAE to 0.14 can be achieved.
The two original datasets are a significant contribution of this project. We invested significant efforts in the development of these datasets. In the future, they could serve as reference data sources for the research and development of new methods for computerassisted proportion judgment.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The