Learning Numerosity Representations with Transformers

: One of the most rapidly advancing areas of deep learning research aims at creating 1 models that learn to disentangle the latent factors of variation from a data distribution. However, 2 modeling joint probability mass functions is usually prohibitive, which motivates the use of 3 conditional models assuming that some information is given as input. In the domain of numerical 4 cognition, deep learning architectures have successfully demonstrated that approximate numeros- 5 ity representations can emerge in multi-layer networks that build latent representations of a set 6 of images with a varying number of items. However, existing models have focused on tasks 7 requiring to conditionally estimate numerosity information from a given image . Here we focus on 8 a set of much more challenging tasks, which require to conditionally generate synthetic images 9 containing a given number of items. We show that attention-based architectures operating at the 10 pixel level can learn to produce well-formed images approximately containing a speciﬁc number 11 of items, even when the target numerosity was not present in the training distribution.


Introduction
In recent years, there has been a growing interest in the challenging problem of 17 unsupervised representation learning [1]. Compared to the first wave of supervised 18 deep learning success [2], unsupervised learning has great potential to further improve 19 the capability of artificial intelligence systems, since it would allow building high-level, 20 flexible representations without the need of explicit human supervision. Unsupervised 21 deep learning models are also plausible from a cognitive [3] and biological [4] perspective, 22 because they suggest how the brain could extract multiple levels of representations from 23 the sensory signal by learning a hierarchical generative model of the environment [5][6][7][8]. 24 Early approaches based on deep belief networks [9] already established that un- 25 supervised representation learning leads to the discovery of high-level visual features, 26 such as object parts [10] or written shapes [11,12]. However, the full potential of deep 27 generative models was revealed by the introduction of variational autoencoders (VAE) 28 [13] and generative adversarial networks (GAN) [14], which can discover and factorize 29 extremely abstract attributes from the data [15,16]. These architectures can be further 30 extended to promote the emergence of even more disentangled representations, such 31 as in beta-VAE [17] and InfoGAN [18], or can exploit attention mechanisms to produce 32 meaningful decompositions of complex visual scenes [19]. 33 An interesting case study to investigate the representational capability of deep 34 learning models is that of numerosity perception, which consists of rapidly estimating the 35 number of objects in a visual scene without resorting to sequential counting procedures 36 [20]. Compared to other high-level visual features, numerosity information is partic- 37 ularly challenging to extract because it refers to a global property of the visual scene, 38 which co-varies with many other non-numerical visual features such as cumulative 39 variation. Attention mechanisms [27] were first introduced in the context of machine 48 translation to overcome the limitations of sequence-to-sequence architectures [28], which 49 aimed at compressing the information contained in temporal sequences into fixed-length 50 latent vectors. Shortly after, a novel architecture based solely on attention called Trans-51 former [29] achieved new heights by completely dropping recurrence and convolutions.

52
In analogy with the dynamics of associative memories [30], the power of this approach 53 lies in the possibility of using a global attention mechanism to precisely and adaptively 54 weight the contribution of each input element during processing. Transformers are start-55 ing to be applied also outside the language domain, with notable success in challenging 56 computer vision tasks [31][32][33].

57
These promising results motivated our present work. In particular, we demon-58 strate that attention mechanisms can be successfully exploited to learn disentangled 59 representations of numerosity, which can be used to generate novel synthetic images 60 approximately containing a given number of items. Inspired by recent approaches that 61 evaluated the capability of deep generative models to create novel attributes and their 62 combinations [34], we probed the Transformer in different generative scenarios requir-63 ing to produce specific numerosities that were never encountered during training. We 64 also analysed the internal structure of the representational code, in order to investigate 65 whether numerosity information could be mapped into a lower dimensional space that 66 preserves the semantics of cardinal numbers [35]. a controlled number of objects, which could be specified by manipulating the initial 77 state of the generative process 1 . Crucially, the generative model might even learn to 78 produce out-of-distribution samples belonging to areas of the p(x, n) support that are 79 not represented in D, that is, images containing a number of objects that was never 80 experienced during training.

81
Practically, we focus on an equivalent representation of the target density, which exploits the chain rule to allow the density estimation algorithm to work autoregressively: where x = (x 1 , . . . , x r ) represents a flattened image x made by r pixels. Let q(x|n, θ ) be the approximated conditional PMF produced by the density estimation algorithm, with θ denoting the optimal model parameters; it originates from the minimization of the following negative log-likelihood: Step (4)  terized by spatial relationships (e.g., images); its backbone, indeed, is built from the 87 self-attention layers devised in [29]. Overall, the following mapping is implemented: 88 (x, n) → P ∈ R q×p ; where x = (x 1 , . . . , x q ) denotes the categorical input intensities, 89 q ≤ r and p T i (i.e., the i-th row of P) represents the conditional PMF 2 associated to x i .

98
The deployed encoder only accepts sequences of real-valued d-dimensional vectors. As a consequence, the supplied dataset entries undergo careful processing. Firstly, the (x 1 , . . . , x q−1 ) intensities are mapped into q − 1 embeddings 3 , X ∈ R (q−1)×d . Then, the encoder input is computed as: where s ∈ R d encodes the equivalence class to which the considered image belongs 99 (i.e., the numerosity n) while E ∈ R q×d stores information about the pixel positions. The density support, of cardinality p, coincides with the set of input intensities. 3 Pixel x q is never consumed by autoregression. In the following graphs, the gray level obtained during the previous pass is appended to the input sequence, and the process is repeated.
The encoder consists of 2L properly stacked multi-head scaled dot-product attention (mha(•)) and point-wise fully connected (fc(•)) sub-layers. Residual connections and layer normalizations (norm(•)) complete the architecture. Resuming from (5), describe the encoder pipeline, with the l ∈ [1, L] subscript denoting the considered layer. The detailed implementations of mha(•), fc(•) and norm(•) can be found in [29]. Finally, the linear(•) and softmax(•) functions are assembled to produce the target conditional densities: The attention graphs [37] showed in Figure   The first dataset, which we call Uniform Dots, contained images featuring objects of uniform size (see samples in top row of Fig. 2). In this dataset the numerosity information is perfectly correlated with the total number of active pixels, which does not allow to assess to what extent the Transformer can disentangle numerosity from cumulative area. We thus also introduced a second dataset, which we call Non-Uniform Dots, containing images featuring objects of different size and constant (on average) cumulative area (see samples in bottom row of Fig. 2). Let A dot ∼ N(µ dot , σ 2 dot ) be the random variable quantifying the individual area covered by a dot. The total area covered in a frame characterized by n dots can be expressed as: Setting the other hand, they allowed to study the internal structure of the Transformer's latent 128 space, in order to investigate whether it could embed the semantics of cardinal numbers.

129
As an initial assessment, the Transformer was evaluated in a straightforward conditional generation task: given the ground-truth (x, n) tuple, the goal is to approximate x through the modeled q(x|n, θ ), incrementally building the imagex according to: In other words, each pixel is determined by those preceding it in the fixed scan order, and 130 the current ground-truth pixel values are provided as input at each time step. This task 131 was only used to monitor learning progress, since it is well-known that one-step-ahead 132 prediction is much easier compared to autoregressive self-generation [38].

133
In the more challenging spontaneous generation tasks, we probed the capability of the Transformer to build an entire novel imagex from scratch according to: Unlike (12), each pixel intensity is now conditioned on the previously sampled ones; be mapped into a lower-dimensional space, akin to an ordered "number line" [40].

163
To explore the possibility that the learned SoSs could be arranged along a one-or

180
After each generation task, we computed the SoS-specific histograms of the gener-181 ated numerosities. We provide two different histogram visualizations: one depicts the 182 relative frequency of each generated numerosity [34,42], while a 2D histogram is used to 183 reproduce the visualization often used in human behavioral studies [43].

184
The generation histograms related to the spontaneous generation over trained numerosi-185 ties task are shown in Fig. 3. Especially for the Uniform Dots dataset (top panels), it is 186 evident that the Transformer is able to create synthetic images with a specified numeros-  variability tends to increase with numerosity [42,44], and that numerosity estimation can 197 be altered by confounding non-numerical magnitudes [21,25].

198
Notably, the synthetic images produced by our Transformer are much more precise   suggesting that the learned embeddings approximately capture a sort of "successor 225 function" only over the local neighborhood of a specific numerosity.

226
The lower-dimensional manifold structure of the encoder space is shown in Fig.   227 5. Interestingly, and in partial alignment with other recent computational work [35], it 228 seems that the topology of the numerosity embeddings preserves the strict ordering  tures. However, it should be stressed that the generation process is error-prone and thus bility mass function to be estimated, which makes generalization to out-of-distribution 258 samples particularly challenging [34]. A key open issue is thus to establish whether 259 domain-general deep learning architectures could extrapolate numerical knowledge well 260 beyond the limit of their training distribution, which would require to learn more ab-261 stract conceptual structures, such as the successor function [46], that form the foundation 262 of our understanding of natural numbers [47]. ways, for example by restricting self-attention receptive fields to local neighborhoods [31], 268 reducing image resolution [32] or focusing on image patches [33]. In the present work,

Hyperparameter
Value Size "S" Size "M" Size "L"  Figure A2. Sample images conditioned on the interpolated SoSs (i.e., spontaneous generation over interpolated numerosities task). From top to bottom, the displayed rows correspond to the unseen numerosities 2, 4 and 6, respectively. Figure A3. Sample images conditioned on the extrapolated SoS (i.e., spontaneous generation over extrapolated numerosities task). The results reported refer to α = 1. Note how the rendering of dots is qualitatively less precise than the ones shown in Figures A1 and A2.