Disentangled Autoencoder for Cross-Stain Feature Extraction in Pathology Image Analysis

: A novel deep autoencoder architecture is proposed for the analysis of histopathology images. Its purpose is to produce a disentangled latent representation in which the structure and colour information are conﬁned to di ﬀ erent subspaces so that stain-independent models may be learned. For this, we introduce two constraints on the representation which are implemented as a classiﬁer and an adversarial discriminator. We show how they can be used for learning a latent representation across haematoxylin-eosin and a number of immune stains. Finally, we demonstrate the utility of the proposed representation in the context of matching image patches for registration applications and for learning a bag of visual words for whole slide image summarization.


Introduction
Digital pathology underwent a significant transformation during the last years, to become a key tool in modern clinical practice and research. This was facilitated by the developments both in hardware, with cheaper and faster computers and storage, the availability of more powerful graphics processing units (GPUs) with larger memory, faster slide scanners, etc., and computational methods, with deep learning approaches being the most prominent example. Having faster and more performant pathology slide scanners, enables almost mass-production of digital slide archives and puts high pressure on the computational infrastructure. For example, a single whole slide image (WSI) of a tissue sample scanned at 20× magnification easily contains 10 10 multi-channel pixels, describing highly complex multi-scale structures. Clearly, there is a need for highly performant image analysis methods able to detect, extract, measure and summarize these structures. For a detailed review of the current challenges and trends in digital pathology, see [1,2]. Immunohistochemistry (IHC) is a method for visualizing at light microscopy level the spatial distribution of targets (antigens) of interest by exploiting the specificity of antibody binding with its antigen. Routine IHC is limited in the number of antibodies (colours) that can be used in a single tissue section meaning that for studying the (co-)distribution of several targets one needs several consecutive sections. More recently, a number of protocols (e.g., stain multiplexing) allow the use of several antibodies for visualizing multiple targets at the same time. Here, we are interested in multi-stain experiments that employ the routine IHC approach which rely on analysing consecutive sections stained with different antibodies. The central question is how one could jointly investigate the resulting images in order to identify correlations or co-occurrences (or lack thereof) of various

Materials
A set of images representing colon cancer cases (COAD set) downloaded from the "Automatic Non-rigid Histological Image Registration" challenge website (https://anhir.grandchallenge.org/Data) (organized in the framework of ISBI 2019). The images were scanned with a 3DHistec Panoramic MIDI II scanner at 10× magnification, for a resolution of 0.468 microns/pixel with white balance set to auto, and grouped in series of consecutive sections, each consisting in one H&E section and several IHC sections (representing immune response and hypoxia). From this set, Appl. Sci. 2020, 10, 6427 3 of 14 10 series (H&E and IHC from the same block, totalling 55 whole slide images) were used for building the models and other 10 series for testing them. See Supplementary Materials for the list of images used in training and testing Table S1.

Disentangled Autoencoder
In general, an autoencoder consists in two main blocks (networks): an encoder and a decoder which are designed to perform (i) encoding of the input data into a lower dimensional latent embedding, and (ii) reconstruction of the input from the lower dimensional space. This can be seen as a composition of two functions such that the output x = f d (θ; f e (φ; x)) where f e (φ; ·) and f d (θ; ·) are the encoder and decoder functions, respectively, and φ and θ are their corresponding parameters that are learned through an optimization process ( Figure 1). The objective function for this optimization process could be, for example, the mean squared error or the cross entropy. The main goal when designing an autoencoder is to obtain a latent representation z of the input which captures the main traits of the modelled domain that generalise well to unseen data while being robust to noise. The encoded sample z can be obtained by drawing a sample from a multinormal distribution: thereby yielding better generalization properties [20]. Several variants of the basic autoencoder have been proposed, including denoising autoencoder [21] and sparse autoencoder [22]. In the present work, the model is based on the β−variational autoencoder [23] for disentangled representation learning. The latent loss from the original variational autoencoder formulation by Kingma and Welling [20] is multiplied with a constant value β > 1. This additional weighting factor promotes independence of the latent dimensions as it enforces stronger similarity between the learned distribution and the standard multivariate normal prior. Independence between individual dimensions results in more disentangled features [24].

Disentangled Autoencoder
In general, an autoencoder consists in two main blocks (networks): an encoder and a decoder which are designed to perform (i) encoding of the input data into a lower dimensional latent embedding, and (ii) reconstruction of the input from the lower dimensional space. This can be seen as a composition of two functions such that the output ′ = ( ; ( ; )) where ( ;⋅) and ( ;⋅) are the encoder and decoder functions, respectively, and and are their corresponding parameters that are learned through an optimization process ( Figure 1). The objective function for this optimization process could be, for example, the mean squared error or the cross entropy. The main goal when designing an autoencoder is to obtain a latent representation z of the input which captures the main traits of the modelled domain that generalise well to unseen data while being robust to noise. The encoded sample z can be obtained by drawing a sample from a multinormal distribution: thereby yielding better generalization properties [20]. Several variants of the basic autoencoder have been proposed, including denoising autoencoder [21] and sparse autoencoder [22]. In the present work, the model is based on the −variational autoencoder [23] for disentangled representation learning. The latent loss from the original variational autoencoder formulation by Kingma and Welling [20] is multiplied with a constant value > 1. This additional weighting factor promotes independence of the latent dimensions as it enforces stronger similarity between the learned distribution and the standard multivariate normal prior. Independence between individual dimensions results in more disentangled features [24]. The basic components, namely the encoder, sampler, and decoder, are extended by a classifier and a discriminator, leveraging the class label information. In the context of differently stained histology images, the class label denotes the domain of the data (i.e., H&E or type of immune stain: CD44, FOXP3, etc.). The overall model architecture is depicted in Figure 1. Let be the set of categories in the data, with | | denoting the set cardinality, i.e., the number of different categories. The classifier is trained to predict the class label distribution From activations ̂ approximating the true label ∈ from a subspace of the latent code The basic components, namely the encoder, sampler, and decoder, are extended by a classifier and a discriminator, leveraging the class label information. In the context of differently stained histology images, the class label denotes the domain of the data (i.e., H&E or type of immune stain: CD44, FOXP3, etc.). The overall model architecture is depicted in Figure 1. Let χ be the set of categories in the data, with |χ| Appl. Sci. 2020, 10, 6427 4 of 14 denoting the set cardinality, i.e., the number of different categories. The classifier is trained to predict the class label distribution From activationsŷ s approximating the true label y ∈ χ from a subspace of the latent code z s for an input x. Therefore, it serves as a measurement of class-related information contained in the code subset. The discriminator works in an equivalent way on the complementary subspace of the latent code, z m , predicting the same label. The difference lies in how the loss terms are used in updating the model parameters.
The weights of the individual network components, parameterized by Φ, ψ, φ, and θ for the encoder, classifier, discriminator, and decoder respectively, are optimized with different objective functions during joint model training. The latent loss, for a sample x is computed as the Kullback-Leibler divergence between the encoder output N x (see Equation (2)) and the standard multivariate Gaussian prior with zero mean and diagonal covariance matrix with unit variances, following [20] but with the additional weighting factor β from [24] (see also Equation (8)). The reconstruction loss, for an image (region) x ∈ R h×w×c of width w, height h and number of channels c, is based on the sum of absolute differences over the image domain. The supervised softmax classification loss functions are and respectively, and are minimized by updating the classifier and discriminator weights ψ and φ. The joint encoder/decoder loss function combines the losses defined in Equations (3) and (5)-(7) into a single grand loss, as a weighted sum of reconstruction loss (L R ), latent loss (L S ), classification loss (L C ), and discriminator loss (L D ). Note that the discriminator loss is used in an adversarial manner such that Φ is optimized to maximize L D , but minimize L C . The respective terminology of an adversarial and a discriminator was initially introduced by Goodfellow in [25] and is by now heavily associated with generative adversarial networks (GANs). Despite not being a GAN-based approach, we decided to use the terms as the components have similar properties and purpose. The discriminator acts as a measurement of model performance, as low accuracy after training marks low content of stain-related information in z m , and the loss is used to update the weights of the encoder Φ, similar to how the generative model is trained in a GAN approach. By restricting the label-related information to a specific part of the latent code z s , the remaining part z m becomes domain-invariant as it contains structural information which is persistent across domains. A similar approach is also pursued in [26,27]. Disentangling the latent space allows further applications to decisively choose which features to include for the task at hand. This is already implicitly done by the classification network.

Network Architecture
Residual layers with skip connections proposed by [28] are used for the encoder to provide stronger gradients and to increase the network depth in order to learn higher level features. Spatial down-sampling is achieved by maximum pooling layers while the ResNet blocks have a stride of 1 and use the implementation from [29]. The full encoder is shown in Figure 2. The sampler activations are used as mean and lower triangle of the covariance matrix Σ for having a multi-variate normal distribution, ∼ ( , Σ ), with Σ = (Σ )′ Σ , with prime indicating the matrix transposition. Then, for an input sample the encoder and sampler output are The activation functions used are the identity mapping ( ) = and the exponential function ( ) = for the mean and covariance components, respectively. Following standard autoencoder conventions, the decoder approximately reconstructs the input by a series of deconvolution and up-sampling operations without residual skip connections. A kernel size of 2 is used for the first deconvolution to increase the spatial resolution like pooling layers, see Figure 3. Both the classifier and the discriminator are single layer dense networks with | | weights, since the latent representation is already an abstract high-level representation of the raw data. The sampler activations are used as mean µ x and lower triangle of the covariance matrix Σ LT x for x having a multi-variate normal distribution, with prime indicating the matrix transposition. Then, for an input sample x the encoder and sampler output are The activation functions used are the identity mapping f (x) = x and the exponential function f (x) = e x for the mean and covariance components, respectively.
Following standard autoencoder conventions, the decoder approximately reconstructs the input x by a series of deconvolution and up-sampling operations without residual skip connections. A kernel size of 2 is used for the first deconvolution to increase the spatial resolution like pooling layers, see Figure 3.
Both the classifier and the discriminator are single layer dense networks with |χ| weights, since the latent representation is already an abstract high-level representation of the raw data.
The final model used for the experiments was trained on 10,000,000 image patches of size 64 by 64 pixels for approximately 2,500,000 steps with a batch size of 256. The encoder is comprised of 6 ResNetV2 blocks with 2 units, followed by activation, batch normalization and pooling layers while the decoder has 8 transposed convolution, activation and batch norm layers, as well as 6 unpooling layers in order to restore the initial input shape.

Distances in the Latent Space
Once a model is trained, the latent representation z of any input image x can be obtained. Then, for any practical applications, a distance (or similarity function) needs to be defined which should satisfy the metric axioms and, hopefully, will capture the intended distance between the corresponding inputs x 1 , x 2 . The stochastic process involved in obtaining z violates the first axiom, therefore we use a distance over the probability distributions. The sampler activations are used as mean and lower triangle of the covariance matrix Σ for having a multi-variate normal distribution, ∼ ( , Σ ), with Σ = (Σ )′ Σ , with prime indicating the matrix transposition. Then, for an input sample the encoder and sampler output are The activation functions used are the identity mapping ( ) = and the exponential function ( ) = for the mean and covariance components, respectively. Following standard autoencoder conventions, the decoder approximately reconstructs the input by a series of deconvolution and up-sampling operations without residual skip connections. A kernel size of 2 is used for the first deconvolution to increase the spatial resolution like pooling layers, see Figure 3. Both the classifier and the discriminator are single layer dense networks with | | weights, since the latent representation is already an abstract high-level representation of the raw data.
The final model used for the experiments was trained on 10,000,000 image patches of size 64 by 64 pixels for approximately 2,500,000 steps with a batch size of 256. The encoder is comprised of 6 ResNetV2 blocks with 2 units, followed by activation, batch normalization and pooling layers while the decoder has 8 transposed convolution, activation and batch norm layers, as well as 6 unpooling layers in order to restore the initial input shape. Since the latent space representation of an input image x is given by a multivariate Gaussian distribution, the distances between these representations cannot be computed as Euclidean distances. Let P = N(µ P , Σ P ) and Q = N µ Q , Σ Q be the two multivariate Gaussians corresponding to the latent vectors z 1 and z 2 . In our case, we tested two distances:

Distances in the Latent Space
• Bhattacharyya distance (BD): • Symmetrized Kullback-Leibler divergence (SKLD): where D KL is Kullback-Leibler divergence as used in the latent loss from Equation (3). The Equation (11) further simplifies in the case of diagonal covariance matrices for the two distributions.
To measure the similarity of two patches, we therefore compute the Bhattacharyya distance given in Equation (10) or the symmetrized KL divergence from Equation (11) of the probability distributions defined by the encoder output for given input patches. We leverage the disentanglement properties of the model by choosing only the dimensions which capture the morphological information instead of the whole latent space, in order to measure the similarity in a stain independent manner. Therefore, only the probability space over the dimensions defining z m are used for the distance computation. Visualizations of the whole latent space and the subspaces using principal component analysis (PCA) and t-stochastic neighbour embedding (tSNE) are shown in Figures 4 and 5 respectively. disentanglement properties of the model by choosing only the dimensions which capture the morphological information instead of the whole latent space, in order to measure the similarity in a stain independent manner. Therefore, only the probability space over the dimensions defining are used for the distance computation. Visualizations of the whole latent space and the subspaces using principal component analysis (PCA) and t-stochastic neighbour embedding (tSNE) are shown in Figure 4 and Figure 5 respectively.

Bag of Visual Words
The use of visual codebooks as basis for summarizing images has seen a lot of applications since it has been initially proposed [30], being successfully used in analysis of digital pathology images as well [31,32]. Generally, the method requires constructing a dictionary of visual words (a codebook) that will be used as templates for summarizing the images. An image is recoded in terms of this codebook by assigning to each local neighbourhood (usually small rectangular regions) a code representing the closest template from the codebook. After recoding, one can simply use the histograms of such codes as a summary of the image. For the construction of the codebook, the kmeans clustering and the Gaussian Mixture Models (GMMs) are the most common choices [30], and are typically used with either standard bag-of-visual words or other aggregators. Jégou et al. [33] give a comprehensive comparison of various design choices.
For our case, the latent representation of the input images forbids the use of k-means of GMMs since these methods implicitly assume a Euclidean space for their input data. In our case, we resorted to use the symmetrized Kullback-Leibler divergence (Equation (11)) with DBSCAN clustering algorithm [34]. Since the algorithm does not identify centres of the clusters but rather core samples characterizing the clusters, we used the latter as the codewords in the dictionary.

Statistical Analyses
For comparing the results of patch matching in image registration scenario, we used Wilcoxon rank sum test for pairs of measurements (errors measures in pixels) with a predefined significance Figure 5. PCA of the stain (left) and morphology (right) subspaces. Coloring indicates the label (staining) and the data was subjected to a sphering (whitening) transformation for better visualization.

Bag of Visual Words
The use of visual codebooks as basis for summarizing images has seen a lot of applications since it has been initially proposed [30], being successfully used in analysis of digital pathology images as well [31,32]. Generally, the method requires constructing a dictionary of visual words (a codebook) that will be used as templates for summarizing the images. An image is recoded in terms of this codebook by assigning to each local neighbourhood (usually small rectangular regions) a code representing the closest template from the codebook. After recoding, one can simply use the histograms of such codes as a summary of the image. For the construction of the codebook, the k-means clustering and the Gaussian Mixture Models (GMMs) are the most common choices [30], and are typically used with either standard bag-of-visual words or other aggregators. Jégou et al. [33] give a comprehensive comparison of various design choices.
For our case, the latent representation of the input images forbids the use of k-means of GMMs since these methods implicitly assume a Euclidean space for their input data. In our case, we resorted to use the symmetrized Kullback-Leibler divergence (Equation (11)) with DBSCAN clustering algorithm [34].
Since the algorithm does not identify centres of the clusters but rather core samples characterizing the clusters, we used the latter as the codewords in the dictionary.

Statistical Analyses
For comparing the results of patch matching in image registration scenario, we used Wilcoxon rank sum test for pairs of measurements (errors measures in pixels) with a predefined significance level α = 0.05. When analysing the results of image summarization for intra-tumour heterogeneity scenario, we used hierarchical clustering with Ward distance for clustering histograms of codewords (with Earth-mover distance between histograms). All statistical analyses were performed in R version 3.6.2.

Results
We have implemented the method described above in Python, using Tensorflow 1.12 library. The code and a number of trained models are available from https://github.com/hechth/dpath_gae.

Latent Space Disentanglement
The disentanglement properties of the trained model were assessed visually by inspecting latent graphical representations of the learnt latent space, such as PCA and tSNE. The embedding of the whole latent space using tSNE ( Figure 4) shows clear clustering of the data according to morphology and staining.
The subspaces z s and z m were visualized using PCA ( Figure 5) and label related colouring to emphasize on the distribution of identically stained images. The subspace used for stain prediction by the classifier has a tight label related coupling while the structural embedding in the complementary part indicates loose stain related clustering, but morphologically similar image patches are grouped together.

Image Registration Scenario
We compared the performance of our method with patch-based adaptations of (i) sum of absolute differences and (ii) mutual information. For a given set of corresponding landmarks in two images, we computed the similarity of patches extracted at the landmark positions in the source image with all positions in a 64 × 64 window around the corresponding landmark in the target image. The score was calculated as the distance between the position with the best similarity value and the actual landmark position in the target image. This score was averaged over all landmarks for eight different image pairs coming from the ANHIR 2019 challenge dataset. Each image pair had between 60 and 90 landmarks, 676 in total.
In Figure 6 boxplots of differences between the best matching position and the hand-marked landmarked are shown for each of the considered metrics. Summaries of these distances are given in Table 1. Only in two cases were the differences between any of the new metric and the mutual information statistically significant (p < 0.05 for Wilcoxon rank sum test): in one case (03_S2_S5), MI was superior to the two proposed distances, while in the second case (05_S2_S6), MI was inferior. Thus, the two new distances seem to perform similarly to MI, while outperforming the raw pixel-based metric (SAD). At the same time, the stability of the new metric seems to be slightly better than the other two, as shown by the lower standard deviation of the corresponding errors (Table 1).
We also explored the applicability of the novel feature extraction method and similarity metric for two different image registration algorithms. We developed a deformable image registration method based on ITK, akin to the methods presented in [35]. The second approach matches ORB key-points extracted using OpenCV using our similarity metric. Neither of the approaches yielded viable results, as can be seen in Figure 7. Both methods are open for further development. Table 1. Only in two cases were the differences between any of the new metric and the mutual information statistically significant (p < 0.05 for Wilcoxon rank sum test): in one case (03_S2_S5), MI was superior to the two proposed distances, while in the second case (05_S2_S6), MI was inferior. Thus, the two new distances seem to perform similarly to MI, while outperforming the raw pixelbased metric (SAD). At the same time, the stability of the new metric seems to be slightly better than the other two, as shown by the lower standard deviation of the corresponding errors (Table 1). Figure 6. Comparison of four distances for patch-matching. Eight different stains pairs (denoted Si_Sj) were tested with four difference metrics: symmetrized Kullback-Leibler divergence (SKLD), Bhattacharyya distance (BD), sum of absolute deviations (SAD) and mutual information (MI), respectively. Only two cases showed statistically significant differences: 03_S2_S5 and 05_S2_S6.  Only two cases showed statistically significant differences: 03_S2_S5 and 05_S2_S6. We also explored the applicability of the novel feature extraction method and similarity metric for two different image registration algorithms. We developed a deformable image registration method based on ITK, akin to the methods presented in [35]. The second approach matches ORB keypoints extracted using OpenCV using our similarity metric. Neither of the approaches yielded viable results, as can be seen in Figure 7. Both methods are open for further development.

Intra-Tumour Heterogeneity Scenario
For the second set of experiments, we constructed a visual codebook using the latent representation of patches. Since we were interested in regions with higher content (in terms of cells), we excluded (before building the codebook) all the close-to-constant image patches (standard deviation of the average over channels below 0.25). The resulting codebook had = 98 codewords and the image patches corresponding to the codewords in the latent space are shown in Figure 8. It is immediately apparent that the selected codewords originate from slides with different types of staining and some can be easily labelled as representing epithelial cells, immune cells, or mucosa regions.

Intra-Tumour Heterogeneity Scenario
For the second set of experiments, we constructed a visual codebook using the latent representation of patches. Since we were interested in regions with higher content (in terms of cells), we excluded (before building the codebook) all the close-to-constant image patches (standard deviation of the average over channels below 0.25). The resulting codebook had n = 98 codewords and the image patches corresponding to the codewords in the latent space are shown in Figure 8. It is immediately apparent that the selected codewords originate from slides with different types of staining and some can be easily labelled as representing epithelial cells, immune cells, or mucosa regions.
For the second set of experiments, we constructed a visual codebook using the latent representation of patches. Since we were interested in regions with higher content (in terms of cells), we excluded (before building the codebook) all the close-to-constant image patches (standard deviation of the average over channels below 0.25). The resulting codebook had = 98 codewords and the image patches corresponding to the codewords in the latent space are shown in Figure 8. It is immediately apparent that the selected codewords originate from slides with different types of staining and some can be easily labelled as representing epithelial cells, immune cells, or mucosa regions.  For each image (30 images, three different stains in 10 series), we computed their representation based on the visual codebook and summarized it in terms of histograms which could be seen as a descriptor of the heterogeneity (of the tissue).
We wanted to assess the efficiency of this simple summaries to capture the image content, thus we clustered the corresponding histograms. The resulting dendrogram is shown in Figure 9. The results are mixed: for some series (e.g., 03, 04, 05, or 07) the three stains cluster closely, while for others (e.g., 01) the different stains are further apart. By inspecting these cases, we have noticed that the tissue was torn and section truncated, clearly impacting negatively on the matching between the slides ( Figure 10).
Appl. Sci. 2020, 10, x FOR PEER REVIEW 10 of 13 For each image (30 images, three different stains in 10 series), we computed their representation based on the visual codebook and summarized it in terms of histograms which could be seen as a descriptor of the heterogeneity (of the tissue).
We wanted to assess the efficiency of this simple summaries to capture the image content, thus we clustered the corresponding histograms. The resulting dendrogram is shown in Figure 9. The results are mixed: for some series (e.g., 03, 04, 05, or 07) the three stains cluster closely, while for others (e.g., 01) the different stains are further apart. By inspecting these cases, we have noticed that the tissue was torn and section truncated, clearly impacting negatively on the matching between the slides ( Figure 10).

Discussion
Having features that capture the textural/morphological aspects independent of the staining would allow learning models across different immune stainings. Such models can be applied to image registration or image retrieval problems as well as to more explorative scenarios such as characterization of tissue heterogeneity. In the proof-of-concept applications we presented here we used pathology sections from colon cancer cases. However, the method is by no means bound to any specific pathology.
We trained an autoencoder to produce a disentangled representation in the latent space with the purpose of separating the colour/stain information from the structure information. The obtained

Discussion
Having features that capture the textural/morphological aspects independent of the staining would allow learning models across different immune stainings. Such models can be applied to image registration or image retrieval problems as well as to more explorative scenarios such as characterization of tissue heterogeneity. In the proof-of-concept applications we presented here we used pathology sections from colon cancer cases. However, the method is by no means bound to any specific pathology.
We trained an autoencoder to produce a disentangled representation in the latent space with the purpose of separating the colour/stain information from the structure information. The obtained representation was tested in scenarios inspired by real-world applications: cross-stain registration of images and comparison of intra-tumour heterogeneity. We stress the fact that these are proof-of-concept analyses with the purpose of studying the possibility of achieving stain invariant feature extraction.
In the first scenario, the results indicate that the newly proposed approach is able to extract relevant information for comparing patches across stains with similar performance as the mutual information distance. The novel method has several advantages over mutual information. The use of probability-based similarity measures (SKLD, BD) makes the approach invariant to identical perturbations of the compared datasets. Further, the model uses data driven features for similarity estimation, explicitly separating feature selection and similarity estimation. Finally, the inference step is computationally efficient.
In the second scenario, we used the same latent representation as before for learning a visual codebook. Since the latent representation is parametrized as a multivariate Gaussian distribution, there is a need for careful selection of the metric to be used in clustering such representations. In our case, we used the SKLD with diagonal Gaussians and a clustering algorithm able to accommodate non-Euclidean distances (DBSCAN). The resulting codebook was used for summarizing the images in terms of histograms (of codewords), as simple descriptors of intra-tumoral heterogeneity. Then, using hierarchical clustering, we showed that most of the series (consecutive sections from the same block) cluster together. There were also some cases which appeared to be distant in the hierarchical clustering. Technical issues, such as tissue tearing and fragmentation, are definitely contributing to these results ( Figure 10). A further application of this approach, not detailed here, is semantic image retrieval: retrieving slide images with similar tissue morphology mix, across different immune stains. At the core, this would application could have an image summarization based on multi-stain visual dictionaries, an extended version of the one developed here.
The latent representation was learned from a set of images in an application-agnostic manner, with no tuning for the latter usage. In a more realistic case, the representation and the latter distance (or visual codebook) would need to be adapted and optimized. Comparing pathology sections across stains is feasible with the approach described above, however better summarization would be needed for capturing the tissue heterogeneity. The results presented here were obtained on a rather limited data set. Clearly, scaling up both the training set and the test set(s) is needed for achieving high confidence in the novel representation.

Conclusions
We proposed a disentangled autoencoder for learning cross-staining representation of whole slide images. The learnt latent representation was used in two realistic applications and its performance deemed satisfactory and promising for a basic model. There are a number of alternative usages for the presented approach that were not discussed here, e.g., virtual staining of the slides (one would alter the latent representation in the subspace corresponding to the stain) or stain normalization (with structure separated from stain, a standard stain reconstruction could be obtained). It is also clear that, for optimal performance, the architecture (and the models provided) need further fine tuning to the intended application.