#
Structured (De)composable Representations Trained with Neural Networks^{ †}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

- Multi-label image classification, i.e., object recognition with multiple objects per image;
- Image retrieval where we query images that look like existing images, but contain altered class labels;
- Single-label image classification on pre-trained instance representations for a previously unseen label;
- Rank estimation with regard to the compression of the representations;
- Hyperparameter selection and the influence on classification performance and sensitivity;
- The effect of environment composition on classification performance: we show that models that are trained with more environments and with a suitable diversity in environment composition lead to more informative features in the penultimate layer.

## 2. Background and Related Work

#### 2.1. Representing Entities with Respect to Context

#### 2.2. Distances to Represent Features

#### 2.3. Random Features

#### 2.4. Interpretable Neural Networks

## 3. CoDiR: Method

#### 3.1. Setting Up Environments

- Hyperparameter ${n}_{e}$ is set, giving the amount of environments.
- Hyperparameter R is set, giving the maximum amount of labels per environment.
- For the j-th environment, we then:
- (a)
- Sample the actual amount of labels ${\mathrm{r}}_{j}\sim U[1,R]\in \mathbb{N}$;
- (b)
- Sample the labels ${\mathrm{l}}_{m}^{(j)}$, with $m\in \{1,\dots ,{\mathrm{r}}_{j}\}$, uniformly without replacement from the set of all discrete environment labels ${l}_{k},k\in \{1,\dots ,{n}_{l}\}$.

**Example**

**1.**

- For the first environment, we sample ${\mathrm{r}}_{1}\sim U[1,5]=1$. Thus, we need to sample one label from {dog, cat, ball, baseball, …} : ${\mathrm{l}}_{1}^{(1)}=$“ball”.
- Similarly, for the second environment, we sample ${\mathrm{r}}_{2}\sim U[1,5]=2$. Thus, we need to sample two labels from {dog, cat, ball, baseball, …} : ${\mathrm{l}}_{1}^{(2)}=$“dog”, and ${\mathrm{l}}_{2}^{(2)}=$“ball”.

#### 3.2. Contextual Distance

#### 3.3. Template and Instance Representations

#### 3.4. Implementation

Algorithm 1: Algorithm of the training process. For matrices and tensors, × refers to matrix multiplication and ∗ to element-wise multiplication. |

#### 3.5. (De)Composing Representations

- Composition: After performing the SVD on the instance representations, one has access to label-level information. The contents of $\mathit{U}$ can then be changed, for example to reflect modified class membership. A new representation can then be rebuilt with the modified information. This will be explained in more detail below.
- Compression: After the SVD, the singular values are available. This means that one can retain the top k singular values and corresponding singular vectors, thus obtaining compressed representations of rank k. From this reduced information, new representations can be rebuilt that still perform well on classification tasks. As the spectral norm for the instance representations is large with a non-flat spectrum, the representations can be compressed significantly, for example by only retaining a few singular vectors of $\mathit{U}$ and $\mathit{V}$ (see Section 4.5). If $k=1$, the number of classes ${n}_{c}=91$, and the number of environments ${n}_{e}=300$, this would equate to a compression of: (combined size of first singular vectors)/(original representation size) $=(91+300)/(91\times 300)=1.4\%$ the size of the original representations. We call this method C-CoDiR(k).

- Modify the information relating to ${c}_{+}$ in $D(s)$. By increasing the value of ${\mathit{U}}_{{c}_{+},:}$, one can increase the distance estimate with respect to class ${c}_{+}$, thus expressing that ${\mathit{D}}^{(s)}\not\subset {c}_{+}$. Practically, one can set the values of ${\tilde{\mathit{U}}}_{{c}_{+},:}$ to the mean of all rows in $\mathit{U}$ corresponding to the classes $\overline{c}$ for which ${\mathit{D}}^{(s)}\not\subset \overline{c}$.
- Modify the information relating to ${c}_{-}$ in $D(s)$. Here, one can decrease the value of ${\mathit{U}}_{{c}_{-},:}$ such that ${\mathit{D}}^{(\tilde{s})}\subset {c}_{-}$. To set the values of ${\tilde{\mathit{U}}}_{{c}_{-},:}$, one can perform an SVD on the matrix composed of all ${n}_{c}$ template representations $\mathit{T}$, thus obtaining ${\mathit{U}}_{\mathit{T}}{\mathit{S}}_{\mathit{T}}{\mathit{V}}_{\mathit{T}}$. As the templates by definition contain estimated distances with respect to environments for all classes, it is then easy to see that by setting ${\tilde{\mathit{U}}}_{{c}_{-},:}={\mathit{U}}_{{\mathit{T}}_{{c}_{-},:}}$, we express that ${\mathit{D}}^{(\tilde{s})}\subset {c}_{-}$, as desired.

#### 3.6. Connection to Random Feature Maps

- Neural networks are used to compute the distances between the distributions defined by instances of class ${c}_{i}$ and those defined by environment ${e}_{j}$. Environments are given by a subset of the dataset with randomly chosen labels. See Section 3.1.
- We learn separate feature maps for each class with respect to all environments. For individual inputs, this leads to a 2D representation where each input can be analyzed with respect to different classes and environments. This structure allows us also to decompose the representation and perform modifications; see Section 3.5.
- Rather than the kernel computation as given by Equation (9), we compute a cosine similarity between the learned feature maps. This is explained in Section 3.3.
- We make a distinction between instance representations and class representations (templates). See Section 3.3.

## 4. Experiments

#### 4.1. Setup

#### 4.2. Multi-Label Image Classification

#### 4.3. Retrieval

- (1)
- NN: the most similar instance to a reference instance is retrieved.
- (2)
- M-NN: an instance is retrieved with modified class membership while contextual information in the environments is retained. Specifically: “Given an input ${s}_{r}$ that belongs to class ${c}_{+}$ but not ${c}_{-}$, retrieve the instance in the dataset that is most similar to ${s}_{r}$ that belongs to ${c}_{-}$ and not ${c}_{+}$”, where ${c}_{+}$ and ${c}_{-}$ are class labels (see Figure 2). We will show that CoDiR is well suited for such a task, as its structure can be exploited to create modified representations ${\mathit{D}}^{(\overline{{s}_{r}})}$ through decomposition as explained in Section 3.5.

#### 4.4. Rank

#### 4.5. Compressed Representations

#### 4.6. Unseen Labels

#### 4.7. Environment Composition

#### 4.7.1. Hyperparameters

#### 4.7.2. Sensitivity

#### 4.7.3. Sufficient Number of Environments

#### 4.7.4. Number of Labels Per Environment

## 5. Discussion

## 6. Conclusions

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Appendix A. Additional Retrieval Results

**Table A1.**For the NN retrieval experiment, we report performance metrics F1 score, Precision (PREC), and Recall (REC) over the class labels. Methods are used in combination with three different base models: ResNet-18/ResNet-101/Inception-v3. All results are the average of three runs.

Method | F1 | PREC | REC |
---|---|---|---|

SEM (single) | 0.64/0.66/0.70 | 0.64/0.67/0.70 | 0.65/0.66/0.70 |

SEM (joint) | 0.71/0.70/0.73 | 0.72/0.70/0.73 | 0.70/0.71/0.73 |

CNN (joint) | 0.71/0.70/0.70 | 0.71/0.70/0.70 | 0.71/0.70/0.70 |

CM | 0.72/0.74/0.74 | 0.73/0.73/0.72 | 0.72/0.74/0.76 |

CoDiR | 0.70/0.72/0.72 | 0.68/0.70/0.71 | 0.71/0.73/0.72 |

C-CoDiR (5) | 0.70/0.72/0.72 | 0.69/0.71/0.71 | 0.71/0.73/0.72 |

**Table A2.**For the M-NN retrieval experiment, we report performance metrics F1 score, Precision (PREC), and Recall (REC) over the class labels. Methods are used in combination with three different base models: ResNet-18/ResNet-101/Inception-v3. All results are the average of three runs.

Method | F1 | PREC | REC |
---|---|---|---|

SEM (single) | 0.69/0.69/0.72 | 0.68/0.68/0.71 | 0.69/0.70/0.73 |

SEM (joint) | 0.64/0.65/0.66 | 0.63/0.64/0.65 | 0.64/0.66/0.66 |

CNN (joint) | 0.67/0.60/0.65 | 0.66/0.60/0.64 | 0.67/0.60/0.66 |

CM | 0.61/0.62/0.65 | 0.62/0.61/0.63 | 0.61/0.62/0.66 |

CoDiR | 0.64/0.65/0.63 | 0.61/0.62/0.61 | 0.67/0.67/0.65 |

C-CoDiR (5) | 0.64/0.64/0.63 | 0.61/0.61/0.62 | 0.67/0.67/0.65 |

## References

- Murdock, B.B. A theory for the storage and retrieval of item and associative information. Psychol. Rev.
**1982**, 89, 609. [Google Scholar] [CrossRef] - Nairne, J.S. The myth of the encoding-retrieval match. Memory
**2002**, 10, 389–395. [Google Scholar] [CrossRef] - Hawkins, J.; Lewis, M.; Klukas, M.; Purdy, S.; Ahmad, S. A framework for intelligence and cortical function based on grid cells in the neocortex. Front. Neural Circuits
**2019**, 12, 121. [Google Scholar] [CrossRef] - Salton, G.; Buckley, C. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag.
**1988**, 24, 513–523. [Google Scholar] [CrossRef] [Green Version] - Robertson, S.; Walker, S. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, 3–6 July 1994; pp. 232–241. [Google Scholar]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv
**2013**, arXiv:1301.3781. [Google Scholar] - Peters, M.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018; Volume 1, pp. 2227–2237. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
- Deerwester, S.; Dumais, S.T.; Furnas, G.W.; Landauer, T.K.; Harshman, R. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci.
**1990**, 41, 391–407. [Google Scholar] [CrossRef] - Singh, S.P.; Hug, A.; Dieuleveut, A.; Jaggi, M. Context Mover’s Distance & Barycenters: Optimal transport of contexts for building representations. In Proceedings of the ICLR Workshop on Deep Generative Models, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Hitchcock, F.L. The distribution of a product from several sources to numerous localities. J. Math. Phys.
**1941**, 20, 224–230. [Google Scholar] [CrossRef] - Rubner, Y.; Tomasi, C.; Guibas, L.J. The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis.
**2000**, 40, 99–121. [Google Scholar] [CrossRef] - Sinkhorn, R. A relationship between arbitrary positive matrices and doubly stochastic matrices. Ann. Math. Stat.
**1964**, 35, 876–879. [Google Scholar] [CrossRef] - Altschuler, J.; Niles-Weed, J.; Rigollet, P. Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. In Proceedings of the Advances in Neural Information Processing Systems 30, Long Beach, CA, USA, 4–9 December 2017; pp. 1964–1974. [Google Scholar]
- Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein Generative Adversarial Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 214–223. [Google Scholar]
- Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A.C. Improved training of wasserstein gans. In Proceedings of the Advances in Neural Information Processing Systems 30, Long Beach, CA, USA, 4–9 December 2017; pp. 5769–5779. [Google Scholar]
- Miyato, T.; Kataoka, T.; Koyama, M.; Yoshida, Y. Spectral Normalization for Generative Adversarial Networks. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems 28, Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
- Kusner, M.; Sun, Y.; Kolkin, N.; Weinberger, K. From Word Embeddings to Document Distances. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 957–966. [Google Scholar]
- Flamary, R.; Cuturi, M.; Courty, N.; Rakotomamonjy, A. Wasserstein discriminant analysis. Mach. Learn.
**2018**, 107, 1923–1945. [Google Scholar] [CrossRef] [Green Version] - Johnson, W.B.; Lindenstrauss, J. Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math.
**1984**, 26, 1. [Google Scholar] - Candès, E.J.; Romberg, J.; Tao, T. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory
**2006**, 52, 489–509. [Google Scholar] [CrossRef] [Green Version] - Donoho, D.L. Compressed sensing. IEEE Trans. Inf. Theory
**2006**, 52, 1289–1306. [Google Scholar] [CrossRef] - Rahimi, A.; Recht, B. Random features for large-scale kernel machines. In Proceedings of the Advances in Neural Information Processing Systems 21, Vancouver, BC, Canada, 8–13 December 2008; pp. 1177–1184. [Google Scholar]
- Wu, L.; Yen, I.E.; Xu, K.; Xu, F.; Balakrishnan, A.; Chen, P.Y.; Ravikumar, P.; Witbrock, M.J. Word Mover’s Embedding: From Word2Vec to Document Embedding. In Proceedings of the 2018 Conference on EMNLP, Brussels, Belgium, 31 October–4 November 2018; pp. 4524–4534. [Google Scholar]
- Olah, C.; Mordvintsev, A.; Schubert, L. Feature visualization. Distill
**2017**, 2, e7. [Google Scholar] [CrossRef] - Spinks, G.; Moens, M.F. Evaluating textual representations through image generation. In Proceedings of the Workshop on Analyzing and Interpreting Neural Networks for NLP, EMNLP, Brussels, Belgium, 31 October–4 November 2018. [Google Scholar]
- Zagoruyko, S.; Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv
**2016**, arXiv:1612.03928. [Google Scholar] - Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; Lerchner, A. Beta-vae: Learning basic visual concepts with a constrained variational framework. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
- Pandey, A.; Fanuel, M.; Schreurs, J.; Suykens, J.A. Disentangled Representation Learning and Generation with Manifold Optimization. arXiv
**2020**, arXiv:2006.07046. [Google Scholar] - Mroueh, Y.; Sercu, T. Fisher gan. In Proceedings of the Advances in Neural Information Processing Systems 30, Long Beach, CA, USA, 4–9 December 2017; pp. 2513–2523. [Google Scholar]
- Rahimi, A.; Recht, B. Uniform approximation of functions with random bases. In Proceedings of the 2008 46th Annual Allerton Conference on Communication, Control, and Computing, Urbana-Champaign, IL, USA, 23–26 September 2008; pp. 555–561. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2818–2826. [Google Scholar]
- Sorower, M.S. A literature survey on algorithms for multi-label learning. Or. State Univ. Corvallis
**2010**, 18, 1–25. [Google Scholar] - Rasiwasia, N.; Costa Pereira, J.; Coviello, E.; Doyle, G.; Lanckriet, G.R.; Levy, R.; Vasconcelos, N. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM International Conference on Multimedia, Firenze, Italy, 25–29 October 2010; pp. 251–260. [Google Scholar]
- Yang, Z.; Dai, Z.; Salakhutdinov, R.; Cohen, W.W. Breaking the Softmax Bottleneck: A High-Rank RNN Language Model. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Press, W.H.; Teukolsky, S.A.; Vetterling, W.T.; Flannery, B.P. Numerical Recipes 3rd Edition: The Art of Scientific Computing; Cambridge University Press: Cambridge, UK, 2007. [Google Scholar]

**Figure 1.**The last layer of a convolutional neural network is replaced with fully connected layers that map to ${n}_{c}\times {n}_{e}$ outputs ${f}_{i,j}$ that are used to create instance representations that are interpretable along contextual dimensions, which we call “environments”. By computing the cosine similarity, rows are compared to corresponding class representations, which we refer to as “templates”.

**Figure 2.**Example of retrieval results for both NNand M-NN. For NN, based on the representation ${\mathit{D}}^{({s}_{r})}$, the most similar instance is retrieved. For M-NN, ${\mathit{D}}^{({s}_{r})}$ is modified into ${\mathit{D}}^{(\overline{{s}_{r}})}$ before retrieving the most similar instance.Retrieval example

**Figure 3.**Influence of R and ${n}_{e}$ on the F1 score for multi-label image classification using the CoDiR (class) approach with ResNet-18. When modifying R, ${n}_{e}$ is fixed to 300. When modifying ${n}_{e}$, R is fixed to 40. All data points are the average of three runs.Hyperparameters

**Figure 4.**F1 score (y-axis) for multi-label image classification using the CoDiR (class) approach with ResNet-18. For models trained with different amounts of environments (${n}_{e}$), we perform multi-label classification while using only a portion of the environments. The amount of environments that are used (${n}_{s}$) is shown on the x-axis. The environments used are selected randomly. All models are CoDiR (class) with ${n}_{l}=91$ and $R=40$.

**Figure 5.**F1 score (y-axis) for multi-label image classification using the CoDiR (class) approach with ResNet-18. For models trained with different maximum amounts of labels per environment (R), we perform multi-label classification while using only three of the available environments (${n}_{s}=3$). The x-axis indicates the number of labels ${\mathrm{r}}_{j}$ the environments have (e.g., given ${n}_{s}=3$, $x=20$ denotes the classification experiment where 3 environments were used for which in each case, ${\mathrm{r}}_{j}=20$). All models are CoDiR (class) with ${n}_{l}=91$ and ${n}_{e}=300$.F1 for different amount of labels for selected environments

**Figure 6.**F1 score (y-axis) for multi-label image classification using the CoDiR (class) approach with ResNet-18. For models trained with different maximum amounts of labels per environment (R), we perform inference while using only 10 of the available environments (${n}_{s}=10$). The x-axis indicates the number of labels ${\mathrm{r}}_{j}$ the environments have (e.g., given ${n}_{s}=10$, $x=$ “15–20” denotes the classification experiment where 10 environments were used for which in each case, ${\mathrm{r}}_{j}$ is a value between 15 and 20). All models are CoDiR (class) with ${n}_{l}=91$ and ${n}_{e}=300$.

c_{i} | Class label i. |

${n}_{c}$ | The number of distinct class labels. |

${e}_{j}$ | The j-th environment. |

${n}_{e}$ | The number of environments. |

${l}_{k}$ | Environment label k, used to construct the environments. |

${n}_{l}$ | The number of distinct environment labels ${l}_{k}$. |

R | The maximum number of labels per environment. |

${\mathrm{r}}_{j}$ | Random variable denoting the number of labels used to construct environment ${e}_{j}$. |

${\mathrm{l}}_{m}^{(j)}$ | Random variable denoting the m-th sampled label in environment ${e}_{j}$. |

${\mathit{D}}^{(x)}$ | Instance representation $\in {\mathbb{R}}^{{n}_{c}\times {n}_{e}}$ for an instance x. |

${\mathit{T}}_{i,:}$ | Template representation $\in {\mathbb{R}}^{{n}_{e}}$ for class ${c}_{i}$. |

$cos(.,.)$ | Function that computes the cosine similarity between two vectors. |

${t}_{{c}_{i}}$ | The threshold to determine class membership for class ${c}_{i}$. |

**Table 2.**F1 scores, Precision (PREC), and Recall (REC) for different models for the multi-label classification task. $\sigma $ is the standard deviation of the F1 score over three runs. All results are the average of three runs. An asterisk * is added to indicate a model for which outperformance is statistically significant at a 0.05 significance level with respect to its corresponding baseline in a one-tailed t-test with unequal variances. The highest F1 score for each comparison is indicated in bold.

MODEL | METHOD | ${\mathit{n}}_{\mathit{e}}$ | ${\mathit{n}}_{\mathit{l}}$ | R | F1 | PREC | REC | $\mathit{\sigma}$ |
---|---|---|---|---|---|---|---|---|

ResNet-18 | BXENT (single) | - | - | - | 0.566 | 0.579 | 0.614 | $3.6\times {10}^{-3}$ |

ResNet-18 | CoDiR (class) | 300 | 91 | 40 | 0.601 * | 0.650 | 0.613 | $8.0\times {10}^{-3}$ |

ResNet-101 | BXENT (single) | - | - | - | 0.570 | 0.582 | 0.623 | $1.3\times {10}^{-2}$ |

ResNet-101 | CoDiR (class) | 300 | 91 | 40 | 0.627 * | 0.664 | 0.648 | $2.5\times {10}^{-3}$ |

Inception-v3 | BXENT (single) | - | - | - | 0.638 | 0.663 | 0.669 | $5.4\times {10}^{-3}$ |

Inception-v3 | CoDiR (class) | 300 | 91 | 40 | 0.617 | 0.648 | 0.646 | $4.7\times {10}^{-3}$ |

ResNet-18 | BXENT (joint) | - | 300 | - | 0.611 | 0.631 | 0.654 | $1.1\times {10}^{-3}$ |

ResNet-18 | BXENT (joint) | - | 1000 | - | 0.614 | 0.637 | 0.653 | $9.3\times {10}^{-3}$ |

ResNet-18 | CoDiR (capt) | 300 | 300 | 40 | 0.629 * | 0.680 | 0.641 | $2.7\times {10}^{-3}$ |

ResNet-18 | CoDiR (capt) | 1000 | 1000 | 100 | 0.638 * | 0.686 | 0.651 | $1.9\times {10}^{-3}$ |

ResNet-101 | BXENT (joint) | - | 300 | - | 0.598 | 0.619 | 0.640 | $1.1\times {10}^{-2}$ |

ResNet-101 | BXENT (joint) | - | 1000 | - | 0.592 | 0.611 | 0.638 | $7.0\times {10}^{-3}$ |

ResNet-101 | CoDiR (capt) | 300 | 300 | 40 | 0.645 | 0.696 | 0.655 | $2.8\times {10}^{-2}$ |

ResNet-101 | CoDiR (capt) | 1000 | 1000 | 100 | 0.657 * | 0.702 | 0.666 | $1.3\times {10}^{-2}$ |

Inception-v3 | BXENT (joint) | - | 300 | - | 0.644 | 0.671 | 0.675 | $1.5\times {10}^{-2}$ |

Inception-v3 | BXENT (joint) | - | 1000 | - | 0.63 | 0.655 | 0.663 | $3.0\times {10}^{-2}$ |

Inception-v3 | CoDiR (capt) | 300 | 300 | 40 | 0.660 | 0.699 | 0.675 | $1.9\times {10}^{-3}$ |

Inception-v3 | CoDiR (capt) | 1000 | 1000 | 100 | 0.661 | 0.700 | 0.676 | $6.5\times {10}^{-3}$ |

**Table 3.**For the NN and M-NN retrieval, the F1 score of class labels and the Precision (PREC) of the modified labels are shown for the first retrieved instance. The F1% score measures the proportion of the F1 score for the contextual caption words on the M-NN task over the F1 score for the contextual caption words on the NN task. A higher F1% score is better. Methods are used in combination with three different base models: ResNet-18/ResNet-101/Inception-v3. All results are the average of three runs.

Method | NN | M-NN | |
---|---|---|---|

F1 | PREC | F1% | |

SEM (single) | 0.64 /0.66/0.70 | 0.53/0.55/0.55 | 93/87/89 |

SEM (joint) | 0.71/0.70/0.73 | 0.29/0.28/0.31 | 97/100/96 |

CNN (joint) | 0.71/0.70/0.70 | 0.37/0.26/0.33 | 92/90/92 |

CM | 0.72/0.74/0.74 | 0.19/0.15/0.18 | 100/100/100 |

CoDiR | 0.70/0.72/0.72 | 0.30/0.30/0.27 | 97/97/95 |

C-CoDiR(5) | 0.70/0.72/0.72 | 0.30/0.29/0.26 | 97/94/93 |

**Table 4.**For different values of rank k, we show the effect of compression on the outcomes of the retrieval task. For the NN and M-NN retrieval setups, the F1 score of class labels and the Precision (PREC) of the modified labels are shown respectively for the first retrieved item. The F1% score measures the proportion of the F1 score for the contextual caption words on the M-NN task over the F1 score for the contextual caption words on the NN task. A higher F1% score is better. Values are shown for a ResNet-18 CoDiR (capt, ${n}_{l}=300$) model.

Method | NN | M-NN | |
---|---|---|---|

F1 | PREC | F1% | |

C-CoDiR(45) | 0.70 | 0.27 | 100 |

C-CoDiR(23) | 0.70 | 0.28 | 99 |

C-CoDiR(9) | 0.70 | 0.29 | 99 |

C-CoDiR(5) | 0.70 | 0.29 | 98 |

C-CoDiR(3) | 0.70 | 0.29 | 95 |

C-CoDiR(1) | 0.70 | 0.17 | 100 |

**Table 5.**F1 score for a simple logistic regression on pre-trained representations to classify a previously unseen label (“panting dogs”). For the last three models, ${n}_{l}=300$. Methods are used in combination with three different base models: ResNet-18/ResNet-101/Inception-v3. All results are the average of three runs. Note that for cases where both precision and recall are zero, the F1 score is undefined. For simplicity and legibility, we write that in such cases, the F1 score is 0 as well.

Method | F1 |
---|---|

SEM (single) | 0.00/0.00/0.00 |

CoDiR (class) | 0.10/0.06/0.07 |

C-CoDiR(5) (class) | 0.06/0.08/0.09 |

SEM (joint) | 0.00/0.10/0.00 |

CoDiR (capt) | 0.08/0.15/0.20 |

C-CoDiR(5) (capt) | 0.10/0.14/0.19 |

**Table 6.**Standard deviations of the F1 scores of the classification experiment for different values of ${n}_{e}$. Each value is computed on the basis of three runs.

${\mathit{n}}_{\mathit{e}}=2$ | ${\mathit{n}}_{\mathit{e}}=4$ | ${\mathit{n}}_{\mathit{e}}=8$ | ${\mathit{n}}_{\mathit{e}}=16$ | ||
---|---|---|---|---|---|

$R=1$ | $6.5\times {10}^{-2}$ | $2.8\times {10}^{-2}$ | $2.1\times {10}^{-2}$ | $3.5\times {10}^{-2}$ |

**Table 7.**Standard deviations of the F1 scores of the classification experiment for different values of R. Each value is computed on the basis of three runs.

$\mathit{R}=2$ | $\mathit{R}=4$ | $\mathit{R}=8$ | $\mathit{R}=16$ | |
---|---|---|---|---|

${n}_{e}=2$ | $7.2\times {10}^{-2}$ | $2.7\times {10}^{-2}$ | $3.3\times {10}^{-2}$ | $4.4\times {10}^{-2}$ |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Spinks, G.; Moens, M.-F.
Structured (De)composable Representations Trained with Neural Networks. *Computers* **2020**, *9*, 79.
https://doi.org/10.3390/computers9040079

**AMA Style**

Spinks G, Moens M-F.
Structured (De)composable Representations Trained with Neural Networks. *Computers*. 2020; 9(4):79.
https://doi.org/10.3390/computers9040079

**Chicago/Turabian Style**

Spinks, Graham, and Marie-Francine Moens.
2020. "Structured (De)composable Representations Trained with Neural Networks" *Computers* 9, no. 4: 79.
https://doi.org/10.3390/computers9040079