Interpretable Deep Prototype-Based Neural Networks: Can a 1 Look like a 0?

García-Cuesta, Esteban; Manrique, Daniel; Ionescu, Radu Constantin

doi:10.3390/electronics14183584

Open AccessFeature PaperArticle

Interpretable Deep Prototype-Based Neural Networks: Can a 1 Look like a 0?

by

Esteban García-Cuesta

^*

,

Daniel Manrique

and

Radu Constantin Ionescu

Departamento de Inteligencia Artificial, ETSIINF, Universidad Politécnica de Madrid, 28040 Madrid, Spain

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(18), 3584; https://doi.org/10.3390/electronics14183584

Submission received: 25 July 2025 / Revised: 3 September 2025 / Accepted: 5 September 2025 / Published: 10 September 2025

(This article belongs to the Special Issue Feature Papers in Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Prototype-Based Networks (PBNs) are inherently interpretable architectures that facilitate understanding of model outputs by analyzing the activation of specific neurons—referred to as prototypes—during the forward pass. The learned prototypes serve as transformations of the input space into a latent representation that more effectively encapsulates the main characteristics shared across data samples, thereby enhancing classification performance. Crucially, these prototypes can be decoded and projected back into the original input space, providing direct interpretability of the features learned by the network. While this characteristic marks a meaningful advancement toward the realization of fully interpretable artificial intelligence systems, our findings reveal that prototype representations can be deliberately or inadvertently manipulated without compromising the superficial appearance of explainability. In this study, we conduct a series of empirical investigations that demonstrate this phenomenon, framing it as a structural paradox potentially intrinsic to the architecture or its design, which may represent a significant robustness challenge for explainable AI methodologies.

Keywords:

prototype-based networks; interpretable AI; robustness of explanations; neural networks

1. Introduction

Model interpretability is a critical requirement in the development of machine learning models intended to represent aspects of real-world phenomena. Numerous techniques and studies have sought to enhance interpretability, primarily through post hoc analyses or by employing surrogate models that are inherently more interpretable. Despite these efforts, a fundamental challenge persists: establishing a clear correspondence between the knowledge acquired by the model and observable elements of reality [1]. In an effort to address this issue, a Prototype Deep Neural Network (PDNN) was introduced in [2]. Prototype learning is a subclass of intrinsically interpretable methods, derived from the classical paradigm of case-based reasoning known as prototype classification. In this context, a prototype is not constrained to a single observation from the training set, but rather may constitute a generalized representation formed from a combination of multiple samples in the learned latent space [3]. The prototype learning framework typically incorporates an autoencoder in conjunction with a prototype classifier layer. The autoencoder generates a compressed latent representation of the input data, while the prototype layer identifies representative points within this space that both resemble specific inputs and are indicative of particular classes. Each learned prototype can subsequently be visualized through the decoder, enabling interpretability of the learned features, and can be traced back to the output softmax layer, thereby providing insight into class associations. Moreover, this architectural design supports a form of case-based reasoning by identifying the training sample(s) most similar to a given prototype, further enhancing the transparency and accountability of the model’s decision-making process.

In this context, while the interpretability of the method has received considerable attention, the reliability and validity of the learned prototypes have often been overlooked. For relatively simple datasets, such as handwritten digits [4], prototype validation is typically performed by reconstructing and visualizing the prototypes in the input space, followed by a qualitative assessment of whether the prototypes visually resemble the intended classes. However, this raises a critical question: can we confidently assert that a prototype which appears to represent, for instance, the digit “1”, is indeed interpreted as such by the Deep Neural Network (DNN)? In this study, we address this question through a series of experiments, including the development of a novel Deep Prototype-Based Network. Our results demonstrate that visual similarity does not necessarily align with the model’s internal representation, highlighting a potential gap between human interpretability and model-driven classification.

2. State of the Art

Deep networks can reason through learned prototypes while remaining end-to-end trainable [2]. Prototype-based neural networks offer a hybrid approach that combines the power of deep learning with the interpretability of case-based reasoning. Its architecture integrates a convolutional autoencoder with a prototype-based classification layer, allowing the model to explain its predictions through comparisons with prototypical training examples. Building upon this foundational work, ProtoPNet was proposed in [5], that model popularized the “this-looks-like-that” paradigm through patch-to-prototype visualizations, which has also been extended in Semi-ProtoPNet [6] and Ps-ProtoPNet [7] to include rejection capabilities and the combination of models in Comb-ProtoPNet [8]. The training methodology employed techniques such as prototype projection and class-specific weight regularization, achieving competitive performance across fine-grained classification tasks involving bird, car, and dermatological image datasets, while preserving interpretable, image-level justifications.

Two primary strategies have been proposed to overcome the representational limitations inherent in single-layer prototype-based models. First, ProtoTree was introduced in [9], a differentiable decision tree architecture in which the internal nodes correspond to learned prototypes. This structure enables an input to traverse only

O (l o g K)

nodes, producing compact and interpretable rule chains. Second, explicit prototype hierarchies [10] organize low-level visual parts into increasingly abstract semantic concepts, facilitating multi-granular reasoning and enhancing robustness to out-of-distribution inputs.

Hybrid approaches have also been developed to enable reasoning beyond raw pixel representations. Concept Bottleneck Models (CBMs) [11] structure the prediction pipeline around human-interpretable attributes, allowing users to intervene and correct misclassified concepts at inference time. In a complementary direction, the Deep k-Nearest Neighbours (DkNN) algorithm [12] enhances arbitrary classifiers by incorporating layer-wise nearest neighbor retrieval, thereby quantifying model uncertainty through non-conformity scores. Both methods underscore the utility of leveraging intermediate representations, particularly similarity structures within hidden layers, without compromising predictive accuracy, a principle that underpins the multilayer prototype architecture proposed in this work. Additionally, Capsule Networks [13] implement dynamic routing mechanisms to capture part-to-whole relationships, while part-based R-CNNs [14] localize discriminative object regions.

The Deep Prototype Network (DPN) introduced in this work adopts a compositional reasoning framework akin to prior hierarchical or part-based models, but implements this structure through a deliberately layered architecture rather than relying on post hoc clustering of prototypes. This design choice is motivated by the principle that successive prototype layers should progressively capture increasingly abstract semantic concepts or larger spatial extents, thereby enabling richer part-to-whole reasoning directly within the network’s forward pass.

3. Deep Prototype-Based Network Architecture

Figure 1 illustrates the proposed Deep Prototype-Based Network (DPBN) architecture, which extends [2] notation, cost terms, and design philosophy, but stacks several prototype layers in a feedforward chain rather than using a single one. Therefore, each layer has similar architectural complexity to [2] (including the convolutional neural network). The proposed DPBN performs a back-and-forth data flow that couples representational capacity with prototype-based interpretability.

Let

D = {(x_{i}, y_{i})}_{i = 1}^{N} \subset X \times {1, \dots, K}

,

X \subset R^{p}

denote a finite training set sampled from an underlying distribution

P_{X Y}

. All spaces are Euclidean and, therefore, equipped with the canonical inner product

〈 \cdot, \cdot 〉

and the induced norm

{∥ \cdot ∥}_{2}

. The uppercase calligraphic symbols note sets or spaces, and the superscripts in parentheses index the depth

ℓ \in {1, \dots, L}

. The set of all network parameters is denoted as

Θ = {θ_{f^{(1)}}, {θ_{e^{(ℓ)}}, θ_{d^{(ℓ)}}}_{ℓ = 2}^{L}, {P_{j}^{(ℓ)}}, W}

.

3.1. The Forward Encoding Flow

Figure 1 shows that the forward flow begins in the input space, followed by the first encoder, which maps the input data to a latent space. Then, the transformation network f comprises L-1 stacked triples consisting of a prototype layer, the calculation of a similarity vector, and an encoder for the following layer. Finally, the prototype classifier network h takes as input the last prototype layer (L), calculates the similarity vector, and passes the results through a linear mapping with a subsequent softmax layer that produces the estimated classification distribution of an input.

The input space domain

X

is a subset of a p-dimensional real vector space. A sample is denoted by x

\in X

. No further structure is assumed, except for the differentiability of all subsequent mappings defined on

X

.

The first layer of the DPBN is the first encoder, which projects input data onto a lower-dimensional manifold:

\begin{array}{l} f^{(1)} : X ⟶ Z_{1}, Z_{1} = R^{q_{1}}, q_{1} ≪ p, \\ f^{(1)} (x) = z^{(1)} . \end{array}

The mapping

f^{(1)}

is usually implemented using a convolutional neural network that ends with a fully connected layer of width

q_{1}

. The parameters associated with this mapping are collectively denoted by

θ_{f^{(1)}}

and form part of the general parameter set.

The next layer is the first prototype layer

P^{(1)}

,

P^{(1)} = {\{P_{j}^{(1)}\}}_{j = 1}^{M_{1}} \subset Z_{1},

and comprises a finite set of prototypes. The value

M_{1} \in N

is a fixed hyperparameter a priori, while prototypes are learned during the training process. Given a latent representation

z^{(1)}

, its similarity vector is calculated using the squared Euclidean distance to all prototypes. It is collected in the similarity vector

p^{(1)} (z^{(1)}) = {(d_{1}^{(1)} (z^{(1)}), \dots, d_{M_{1}}^{(1)} (z^{(1)}))}^{⊤} \in R^{M_{1}}, d_{j}^{(1)} (z^{(1)}) = {∥z^{(1)} - P_{j}^{(1)}∥}_{2}^{2},

which defines

M_{1}

differentiable functions

d_{j}^{(1)} : Z_{1} \to R^{+}

, thereby enabling the application of gradient-based optimization methods without requiring any modifications.

The next layer is the second encoder. The resulting similarity vector

p^{(1)} (z^{(1)})

computed in the previous layer, which does not contain any positional ordering, is compressed by a learned distance encoder to obtain the second layer representation.

e^{(2)} : R^{M_{1}} ⟶ Z_{2}, Z_{2} = R^{q_{2}},

z^{(2)} = e^{(2)} (p^{(1)} (z^{(1)})) .

The function

e^{(2)}

is implemented as a multilayer perceptron, with its parameters collectively denoted by

θ_{e^{(2)}}

. The representation

z^{(2)}

can be expressed as

z^{(2)} = e^{(2)} \circ p^{(1)} \circ f^{(1)} (x)

using the composition of functions.

Similarly to the first layer, for every subsequent layer

ℓ \in {2, \dots, L}

, there is an encoder that generates the latent space

Z_{ℓ}

, followed by a prototype set

P^{(ℓ)} = {P_{j}^{(ℓ)}}_{j = 1}^{M_{ℓ}} \subset Z_{ℓ}

, then a similarity vector

p^{(ℓ)} (z^{(ℓ)})

, and finally an encoder for the next layer

ℓ + 1

:

p^{(ℓ)} (z^{(ℓ)}) = {(d_{1}^{(ℓ)} (z^{(ℓ)}), \dots, d_{M_{ℓ}}^{(ℓ)} (z^{(ℓ)}))}^{⊤} \in R^{M_{ℓ}},

e^{(ℓ + 1)} : R^{M_{ℓ}} ⟶ Z_{ℓ + 1}, z^{(ℓ + 1)} = e^{(ℓ + 1)} (p^{(ℓ)} (z^{(ℓ)})) .

We denote as

Φ^{(ℓ)} (x) = e^{(ℓ)} \circ p^{(ℓ - 1)} \circ e^{(ℓ - 1)} \circ \dots \circ p^{(1)} \circ f^{(1)} (x) = z^{(ℓ)}

the composite mapping that transforms x into

z^{(ℓ)}

.

To provide the final classification decision after passing through all layers of the DPBN, the final similarity vector

p^{(L)} (z^{(L)}) \in R^{M_{L}}

is fed into a linear map

W \in R^{K \times M_{L}}

to obtain

η (x) = W p^{(L)} (z^{(L)}) \in R^{K} .

The subsequent softmax activation function produces the estimated class distribution

\hat{y} (x)

, where

\hat{c} (x) = arg max \hat{y} (x)

is the predicted class.

A summary of the forward flow from the input vector x to the predicted class distribution is as follows:

x \overset{f^{(1)}}{\to} z^{(1)} \overset{P^{(1)}}{\to} p^{(1)} (z^{(1)}) \overset{e^{(2)}}{\to} z^{(2)} \overset{P^{(2)}}{\to} \dots \overset{e^{(L)}}{\to} z^{(L)} \overset{P^{(L)}}{\to}

p^{(L)} (z^{(L)}) \overset{W + S o f t m a x}{\to} \hat{y} .

3.2. The Backward Decoding Flow

What distinguishes Prototype-Based Networks, and in particular the proposed DPBN, from regular neural networks is the presence of prototypes that, through their reconstruction back into the input space, offer visual explanations for their predictions. However, this visualization is not straightforward in the case of the DPBN. The procedure to visualize deeper prototypes introduces a novel approach called the hypersphere intersection (HS-Int) algorithm, which transforms a reconstructed similarity vector at deep ℓ,

{R S}^{(ℓ)}

in Figure 1 or

{\tilde{p}}^{(ℓ)}

, into a reconstructed code

{\tilde{z}}^{(ℓ)}

.

Figure 1 also illustrates the backward decoding flow, which reconstructs the deepest code

z^{(L)}

back into the input space. This reconstruction process performs L-1 decoding and hypersphere intersection pairs until the first reconstructed code

{\tilde{z}}^{(1)}

is obtained. Finally,

{\tilde{z}}^{(1)}

is decoded back into

\tilde{x}

.

The hypersphere intersection algorithm is based on previous results from algebraic and geodesy/GPS positioning [15,16,17], as well as nonlinear least-squares optimization (Levenberg–Marquardt) [18,19] and the Moore–Penrose generalized inverse matrix theory [20,21].

For any depth

ℓ \geq 2

, the decoder produces the radius vector

{\hat{p}}^{(ℓ - 1)} = {({\hat{ρ}}_{1}^{(ℓ - 1)}, \dots, {\hat{ρ}}_{M_{ℓ - 1}}^{(ℓ - 1)})}^{⊤}, {\hat{ρ}}_{j}^{(ℓ - 1)} \geq 0,

which, together with the prototype set

P^{(ℓ - 1)} = {P_{j}^{(ℓ - 1)}}_{j = 1}^{M_{ℓ - 1}} \subset Z_{ℓ - 1}

, defines

M_{ℓ - 1}

hyperspheres in the ambient space

Z_{ℓ - 1}

as follows:

S_{j} = \{z \in Z_{ℓ - 1} : {∥ z - P_{j} ∥}_{2}^{2} = {\hat{ρ}}_{j}^{2}\},

where

P_{j} = P_{j}^{(ℓ - 1)}

and

{\hat{ρ}}_{j}^{2} = {({\hat{ρ}}_{j}^{(ℓ - 1)})}^{2}

to simplify the notation. Thus, the objective becomes to recover a point

z^{★} \in ⋂_{j = 1}^{M_{ℓ - 1}} S_{j},

or, if the intersection is empty, the solution to the minimization problem:

z^{★} = m i n_{z \in Z_{ℓ - 1}} {F (z)}, F (z) = \sum_{j = 1}^{M_{ℓ - 1}} (∥ z - P_{j} ∥_{2}^{2} - {\hat{ρ}}_{j}^{2}) .

(1)

We draw on previous studies in this field to design the algorithm in three steps. The first is to perform linear reduction by applying Bancroft’s algebraic difference trick [15] to obtain the linear system:

A z = b, A \in R^{(M_{ℓ - 1} - 1) \times q_{ℓ - 1}}, b \in R^{M_{ℓ - 1} - 1} .

Its least-squares solution is

z_{p} = A^{+} b

, where

= A^{+}

is the Moore–Penrose pseudoinverse. Every other solution differs from

z_{p}

by an element of the null space of A:

z = z_{p} + N α, N \in R^{q_{ℓ - 1} \times d}, α \in R^{d}, d = q_{ℓ - 1} - rank (A),

where

rank (A)

represents the rank of matrix A, N stacks an orthonormal basis of

ker A

, with

ker A

being the kernel of matrix A, so d equals the number of remaining degrees of freedom and

α

are their coordinates. If

d = 0

, the linear part has isolated at most one candidate; otherwise, the quadratic constraint of the next step must fix

α

(or show that there is no exact intersection).

The second step inserts the above-mentioned parameterization

z = z_{p} + N α

into the first sphere equation

∥ z - P_{1} ∥_{2}^{2} = {\hat{ρ}}_{1}^{2}

, which yields

q (α) = ∥ z_{p} + N α - P_{1} ∥_{2}^{2} - {\hat{ρ}}_{1}^{2} = 0 .

If

d = 0

, no free parameters remain, and q reduces to a constant; the unique linear candidate

z_{p}

is an exact intersection; otherwise, no actual solution exists.

If

d = 1

, then

q (α)

is an ordinary quadratic in one variable. Its roots have closed forms tabulated in [17] producing up to two exact points, which we collect in

Z_{exact}

.

Lastly, if

d \geq 2

, then it means that the same equation defines at most a

(d - 1)

-dimensional quadric in the

α

space. Therefore, we launch a Levenberg–Marquardt search [19] from

α = 0

to find a feasible

α

(or certify that none exists), following the least-squares recipe in [17].

Finally, any

α

that satisfies

q (α) = 0

is mapped through

z = z_{p} + N α

and added to

Z_{exact}

in the third step. However, if no exact intersection is found, (

Z_{exact} = ⌀

), we minimize the nonlinear least-squares cost

F (z) = \frac{1}{2} \sum_{j = 1}^{M_{ℓ - 1}} {(∥ z - P_{j} ∥_{2}^{2} - {\hat{ρ}}_{j}^{2})}^{2} .

The optimization process uses the Levenberg–Marquardt-damped Gauss–Newton scheme [18,19], which smoothly interpolates between gradient descent and the Gauss–Newton step via an adaptive damping factor. After convergence, we obtain

z^{★}

. The local uniqueness is verified with the Jacobian rank criterion of Abel-Chaffee [16]. The algorithm finally returns

{\tilde{z}}^{(ℓ - 1)} = z^{★}

together with the uniqueness flag.

Once the HS-Int algorithm is described, we can define the decoding cascade, which starts from

z^{(L)}

, applies a decoding step

{\hat{p}}^{(ℓ - 1)} = d^{(ℓ)} (z^{(ℓ)}),

and then applies the HS-Int algorithm

{\tilde{z}}^{(ℓ - 1)} = HS-Int (P^{(ℓ - 1)}, {\hat{p}}^{(ℓ - 1)})

to

ℓ = L, L - 1, \dots, 2

. Note that the decoding step transforms a code from a latent space into a reconstructed similarity vector. Then, the HS-Int algorithm takes the reconstructed similarity vector and the set of prototypes from the corresponding prototype layer to generate the reconstructed code for the previous layer. Finally,

\tilde{x} = d^{(1)} (z^{(1)}) \in X

is the global reconstruction against which the reconstruction term in (2) is measured. This cascade enforces explicit geometric consistency between successive latent spaces, endowing every latent point with a deterministic pre-image in the input domain.

The reconstruction pathway can be summarized as

z^{(L)} \overset{d^{(L)}}{\to} {\hat{p}}^{(L - 1)} \overset{HSInt}{\to} {\tilde{z}}^{(L - 1)} \overset{d^{(L - 1)}}{\to} \dots {\tilde{z}}^{(1)} \overset{d^{(1)}}{\to} \tilde{x} .

The proposed DPBN provides expressive capacity and interpretability. Every depth

ℓ \geq 2

applies the composition

e^{(ℓ)} \circ p^{(ℓ - 1)}

, which is Lipschitz continuous when

e^{(ℓ)}

employs bounded activation functions, endowing the network with a hierarchical piecewise-smooth warping of the latent metric. Yet, interpretability is preserved: each prototype

P_{j}^{(ℓ)}

is deterministically mapped to the input domain by the reconstruction chain, which yields a visual or otherwise perceptible exemplar representing that prototype at its innate semantic granularity.

3.3. Guided Prototype Learning

Together with the pipelines for classification and decoding, it is crucial to define the joint learning objective of the network architecture.

Let

L_{CE} (\hat{y}, y) = - y log \hat{y}

be the cross-entropy between the prediction

\hat{y}

and the ground truth y. With

{z_{i}^{(ℓ)}}_{i = 1}^{B}

referring to the latent codes of a mini-batch of size B, we define the layer-wise prototype coverages as

{R_{1}}^{(ℓ)} = \frac{1}{M_{ℓ}} \sum_{j = 1}^{M_{ℓ}} min_{i = 1, \dots, B} {∥P_{j}^{(ℓ)} - z_{i}^{(ℓ)}∥}_{2}^{2}, {R_{2}}^{(ℓ)} = \frac{1}{B} \sum_{i = 1}^{B} min_{j = 1, \dots, M_{ℓ}} {∥z_{i}^{(ℓ)} - P_{j}^{(ℓ)}∥}_{2}^{2} .

Similarly, with a mini-batch of size B, we establish the loss functions of the autoencoders for the image and the distances as

L_{img}^{AE} = \frac{1}{B} \sum_{i = 1}^{B} {∥{\tilde{x}}_{i} - x_{i}∥}_{2}^{2} L_{dist}^{AE} = \frac{1}{B} \sum_{i = 1}^{B} \sum_{ℓ = 2}^{L} {∥{\tilde{d}}_{i}^{(ℓ)} - d_{i}^{(ℓ)}∥}_{2}^{2} .

As such, for positive scalars acting as regularization terms

λ_{R_{I}}

,

λ_{R_{D}}

,

λ_{R_{1}}^{(ℓ)}

, and

λ_{R_{2}}^{(ℓ)}

, the total loss reads

L = \frac{1}{B} \sum_{i = 1}^{B} L_{CE} (\hat{y} (x_{i}), y_{i}) + λ_{R_{I}} L_{img}^{AE} + λ_{R_{D}} L_{dist}^{AE} + \sum_{ℓ = 1}^{L} (λ_{R_{1}}^{(ℓ)} R_{1}^{(ℓ)} + λ_{R_{2}}^{(ℓ)} R_{2}^{(ℓ)}) .

(2)

All model parameters

Θ

are trained by stochastic gradient descent on (2), with gradients obtained via automatic differentiation through the analytic Jacobian of HS−Int whenever the exact intersection is unique; otherwise, the sub-gradient of the best-fit residual is used.

4. Experiments and Results

In this section, we conduct a systematic evaluation of the proposed Deep Prototype-Based Network, complemented by a series of targeted architectural variants. This analysis aims to demonstrate the model’s capabilities while also addressing a key challenge: the tension between visually interpretable prototype reasoning and the internal mechanisms of class assignment within the network.

The first is a baseline experiment to assess performance on the MNIST dataset [4] and to examine the prototype representations obtained from the proposed DPBN architecture (see Figure 1). The second experiment is a model distillation study in which we investigate whether the prototypes learned by the deepest layer of a four-layer DPBN can be effectively transferred to a significantly shallower model, without compromising either predictive performance or interpretability. The third experiment addresses a binary classification task designed to demonstrate that the network can produce accurate predictions and visualizations consistent with the class “0”, while its internal explanatory mechanisms align more closely with class “1”.

For each experimental configuration, we report both classification accuracy and a novel validation metric introduced in this study—the Normalized Negative Entropy (NNE) score. Quantitative findings are further supported by qualitative analyses, comprising visualizations of reconstructed prototypes and learned weight matrices. Collectively, the results elucidate the impact of each architectural choice on the reliability, interpretability, and overall performance of DPBNs.

All the experiments are conducted using the MNIST [4] benchmark dataset. The original training set, comprising 60,000 images, is randonmly partitioned into

90 %

for training and

10 %

for validation. The official test set of 10,000 images is held out for evaluation. All images are resized to

28 \times 28

pixels and normalized to the

[0, 1]

range using min-max scaling.

4.1. NNE Score

Prototype-Based Networks typically conclude with a linear classifier, where the weight matrix

W \in R^{K \times M_{L}}

maps prototype–sample distances to class logits. To assess the interpretability of this final classification layer, we introduce the NNE score. This metric interprets the prototype-to-class weight matrix W as an information-theoretic channel. After negating W, a row-wise softmax is applied to obtain a class probability distribution for each prototype. The Shannon entropy of each distribution is then computed, inverted, normalized by

log K

, and averaged across all prototypes. The resulting NNE score lies within the interval

[0, 1]

, attaining a maximum value of 1 only when each prototype exhibits a one-hot (i.e., perfectly class-specific) voting pattern.

In contrast to geometric alignment measures—such as the B-Cos transform [22], which enhances the fidelity of saliency maps by maximizing cosine similarity between inputs and classifier weights, or the support-vector-based alignment strategy employed by WASUP [23]—the NNE metric exclusively captures the distributional sharpness of the classification head, providing a complementary perspective on model interpretability.

Because large negative distances should yield a high vote, we first negate the weight matrix and apply a row-wise softmax:

P = σ (- W^{⊤}) \in {[0, 1]}^{M_{L} \times K}, \sum_{k = 1}^{K} P_{i k} = 1 \forall i .

Each row

p_{i}

can now be read as a categorical distribution over classes for prototype i. As such, for a row

p_{i}

, we can compute its Shannon entropy

H (p_{i}) = - \sum_{k = 1}^{K} p_{i k} log p_{i k},

(3)

and normalize it by the maximum possible entropy

log K

to obtain a confidence score

\tilde{H} (p_{i}) = 1 - \frac{H (p_{i})}{log K}, \tilde{H} (p_{i}) \in [0, 1] .

(4)

\tilde{H} (p_{i}) = 1

when

p_{i}

is a one-hot vector (prototype decisively represents one class) and

\tilde{H} (p_{i}) = 0

when

p_{i}

is uniform (prototype conveys no class information).

Finally, we average over all P prototypes:

NNE (W) = \frac{1}{P} \sum_{i = 1}^{P} \tilde{H} (p_{i}) .

(5)

NNE \in [0, 1]

is therefore higher for more interpretable weight matrices. We validated the metric with two extreme cases as shown at Table 1.

It complements structural sparsity metrics such as global/local size and Hoyer–Square norms in ProtoS-ViT [24] and the “transparent-head complexity’’ of Shallow-ProtoPNet [25] by quantifying decisiveness even when the head is dense. Where neuron-level Class-Selectivity Indices normalize entropy over activations [26], NNE is the first to apply an entropy normalization directly to prototype voting weights, making it agnostic to input distribution and layer depth. Recent analyses of classifier weight spectra, enabled by bilinear multilayer perceptrons (MLPs) [27], underscore the value of interpreting the classification head as an information-theoretic channel. In this context, the NNE score provides a concise, task-size-invariant metric that aligns with and supports this perspective.

The proposed metric will be employed throughout this work to evaluate the quality of the classification weight matrix across all experiments, serving as a means to assess and compare the reliability and class specificity of the learned prototypes.

4.2. Experiment 1: Baseline Architecture

The objective of this experiment is twofold: to evaluate the performance of the proposed architecture on the MNIST dataset and to analyze the structure of the resulting prototype representations. As an initial step, we consider the canonical DPBN, which closely follows the configuration introduced by [2], with minor modifications to facilitate subsequent ablation studies. The input pipeline preserves the original

28 \times 28

luminance pixel values of the MNIST digits, thereby avoiding information loss during preprocessing.

Within the network, layer 1 comprises compact convolutional encoder–decoder module that projects each input image into a 40-dimensional latent vector and reconstructs it back into pixel space. The encoder employs a fixed stride pattern of

(2, 2, 2, 2)

and a sequence of convolutional filters

(32, 32, 32, 10)

, progressively reducing the spatial resolution from

28 \times 28

to

2 \times 2

over four downsampling stages. This architectural design ensures that each spatial location in the final feature map corresponds to a roughly square receptive field in the input image, a property leveraged in later experiments to interpret prototype activations in a spatially meaningful manner.

The flattened feature map is subsequently projected into a latent representation of dimension

q_{1} = 15

via a fully connected layer. The choice of

q_{1} < 40

is deliberate, as it constrains the image autoencoder to learn a non-trivial, low-dimensional manifold rather than approximating the identity mapping. This serves as an implicit regularization mechanism for the initial prototype search space. Within this space,

M_{1} = 20

prototypes are allocated, resulting in a prototype-to-latent ratio of

M_{1} / q_{1} = 1.33

. Empirical observations from preliminary experiments indicated that this ratio is sufficient to support a near one-to-one correspondence between digit classes and prototypes, while avoiding the introduction of redundant representations.

The subsequent layers (layers 2 through 4) share an identical structural design, differing from the initial layer primarily in that their encoders are implemented as MLPs operating on similarity vectors rather than raw image data. Specifically, each distance vector produced by layer l is projected into a latent space of dimension

q_{l + 1} = 10

, then decoded back to its original dimensionality

M_{l}

, and subsequently passed to the next prototype layer of equal size. In each layer, the prototypes

p_{j}^{(l)}

are initialized from a uniform distribution

U (0, 1)

and jointly optimized alongside the corresponding autoencoder weights. Importantly, we enforce the condition

M_{l} \geq q_{l}

at every depth to ensure that the resulting distance matrices are overcomplete—a necessary property for the hypersphere intersection solver used to reconstruct prototype representations across layers.

The coefficients we set to

λ_{R_{I}} = λ_{R_{D}} = λ_{R_{1}} = λ_{R_{2}} = 0.5

based on a coarse log-grid search aimed at balancing classification accuracy with reconstruction fidelity. Model training was performed using the Adam optimizer with a fixed learning rate of 1 × 10⁻⁴. An early stopping criterion with a patience of 100 epochs was employed, terminating training when validation accuracy failed to improve over 100 consecutive epochs. On the MNIST dataset, the model converged to a test accuracy of 98.05% after approximately 850 epochs, thereby establishing a strong performance upper bound for the ablation studies that follow.

Interpretability, however, diminishes with increasing network depth. As illustrated in Figure 2, reconstructed prototypes from all four layers are visually compared. Prototypes in the first layer exhibit clearly defined glyphs, each distinctly associated with a specific digit class. This observation confirms that the autoencoder, in conjunction with the

R_{1} / R_{2}

regularization terms, effectively aligns prototypes with the underlying data manifold. By layer 2, the prototypes lose fine-grained details and become more abstract, although they still retain rough class-specific contours. In layers 3 and 4, interpretability degrades substantially: the reconstructed shapes converge into nearly indistinguishable forms, differentiated only by subtle intensity variations and with weakly class-informative interpretable patterns. Despite this loss in human interpretability, the softmax-normalized weight matrix yields a NNE score of 0.26, indicating that the classifier maintains a coherent and class-specific internal representation, even as the prototypes themselves become less visually discernible. This observation is particularly significant, as it indicates that the final-layer prototypes retain sufficient information to support accurate classification while simultaneously integrating multiple visual patterns. This behavior is a consequence of the decreasing dimensionality across successive layers, which compels the network to fuse lower-level features. Accordingly, the process can be interpreted as a hierarchical compositional mechanism that preserves discriminative attributes across layers.

4.3. Experiment 2: Model Distillation

In this experiment, we investigate whether the prototypes learned by the deepest layer of a four-layer DPBN could be effectively transferred to a significantly shallower model without compromising either predictive accuracy or interpretability. Specifically, the prototype layer extracted from the fourth layer of the reference DBPN was copied into a single-layer DPBN, where it remained fixed during training. Only the autoencoder and the classification head were subsequently trained. Qualitative comparisons between the original layer 4 prototypes and those reconstructed by the distilled model, along with the associated weight matrices W₁ and W₂, are shown in Figure 3 and Figure 4.

Despite the substantial reduction in network depth, the distilled model achieved an accuracy statistically indistinguishable from that of the baseline. Moreover, its weight matrix maintained coherence, with an NNE of 0.18. To assess whether this performance could be attributed to the information encoded in the transferred prototypes, we repeated the experiment using randomly initialized prototype vectors, which were likewise frozen during training. Contrary to our initial hypothesis, the network again attained comparable accuracy and produced sharply defined reconstructed prototypes. This suggests that the raw numerical values of the prototypes, in isolation, contribute little semantic content. Rather, interpretability appears to arise from the dynamics of the autoencoder and the spatial configuration of the prototypes within the latent space. Notably, the geometry of this space retains semantic structure: prototypes corresponding to visually similar digits, such as 8 and 9, are positioned closely, whereas those representing semantically distant digits, such as 2 and 9, are more widely separated. It is also worth noting that incorrect associations appear in the weight matrix (Figure 4), where prototype 12 (row 11) is assigned to class ‘0’, whereas it can be visually interpreted as an ‘8’ or ‘9’. We postulate that this latent space organization underlies the richer structure observed in the weight matrix when training begins with learned or trainable prototypes as opposed to random ones. These findings motivated the follow-up experiment described in the subsequent section, in which the network was trained end-to-end using randomly initialized, frozen prototypes. This design aimed to isolate the contribution of latent space geometry from the influence of prototype initialization.

4.4. Experiment 3: Cana 1 Look like a 0?

In this experiment we investigate whether the conclusions drawn in the previous multiclass setting generalize to the simpler binary task of distinguishing the digit “0” from all other handwritten digits. To construct the training corpus, all images corresponding to class “0” from the MNIST training split were retained, while samples from the remaining classes were randomly downsampled to ensure that the alternative class (denoted other) contained an equal number of instances. The resulting dataset was thus balanced, with 80% allocated for training, 10% for validation, and the remaining 10% for testing. All target labels were transformed into binary form using the indicator function

y \mapsto 1 [y \neq 0]

.

We retained the deep prototype architecture of the previous experiments, but specified a four-layer prototype hierarchy

(15, 20) \to (10, 15) \to (10, 15) \to (10, 4),

where each pair denotes the latent dimensionality, and the number of prototypes. The network was trained for 370 epochs with Adam (learning rate =

10^{- 4}

) and early stopping after 100 epochs without validation improvement. The highest validation accuracy achieved was 99.49%, with a final test accuracy of 99.48%. Despite this strong predictive performance, the interpretability metrics were less encouraging. Specifically, the softmax-normalized weight matrix of the final classification layer exhibited a NNE of

0.14

, indicating a relatively flat association between prototypes and class labels. Furthermore, two of the four prototypes reconstructed from the deepest layer resembled noisy, amorphous “clouds” (see Figure 5) rather than well-defined digit patterns, and the weight matrix (see Figure 6) deviated substantially from a one-hot configuration. Regarding class purity or faithfulness, Figure 6 shows a faithfulness of 1/4, where all prototypes are assigned to class 1, but prototype 2 is correctly assigned to 0. This example is representative of the results achieved with other instances, which have been tested under human supervision.

To examine whether the latent geometry captured by the deep model could be more explicit, we distilled the last-layer prototypes into a shallow DPBN comprising a single layer, following the same procedure as in Experiment 2. The prototypes were held fixed during training, while only the autoencoder and classification weights were updated. Training converged after 210 epochs, achieving an accuracy of 99.83%, marginally surpassing that of the original deep model. These results support the hypothesis that a well-structured set of prototypes can facilitate more efficient representation learning, even when the prototypes themselves lack of clear visual interpretability.

A detailed examination of the distilled model revealed an noteworthy behavior. Prototype #4, whose reconstruction clearly depicts a digit “0” (see Figure 7), is associated by the weight matrix with the ’others’ class with a probability of 66 %. An analysis of the twelve test images with latent encodings closest to this prototype (Figure 8) confirms that all correspond to well-formed instances of the digit “0”, each correctly classified as class “0”. Consequently, the prototype provides clear visual evidence of the digit “0” while functionally contributing to the decision for the opposing class. Because the mapping between prototypes and classes is learned independently of the visual characteristics of the prototypes, users lacking access to the weight matrix may be misled into believing that the model’s predictions for the “others” class are based on archetypal non-zero digits, when in fact, the decisive signal is anchored in a visually canonical zero.

The binary classification setting thus amplifies a concern previously suggested by the multiclass experiments: prototype visualizations alone are insufficient to establish model trustworthiness. A prototype may function as a decoy, simultaneously presenting a compelling, human-interpretable exemplar while driving predictions for an unexpected class. Given that the distance vectors input to the linear classification head are high-dimensional, manual verification of class–prototype associations is impractical, and summary metrics such as NNE reflect only global sharpness rather than individual inconsistencies.

5. Conclusions

In this study, we introduced a novel prototype-based neural network architecture incorporating multiple prototype layers. We evaluated its performance and interpretability using the MNIST dataset, demonstrating both its representational capabilities and highlighting critical limitations concerning interpretability. Results demonstrate the network’s capacity to integrate and represent patterns across successive prototype layers. However, we observed a progressive decline in the visual interpretability of prototypes at deeper layers, complicating human understanding of the learned representations. Importantly, the decline in visual clarity at deeper prototype layers did not compromise the model’s predictive performance, which remained consistently high—indicating that the final prototype layer retained strong discriminative capacity despite reduced interpretability. A distilled shallow model, constructed by reusing the final prototype layer from the original DPBN as a fixed component, achieved a slight improvement in classification accuracy (99.83%). However, it exposed a critical flaw—prototypes visually representing “0” contributed internally to decisions for the “1” class. Hence, visualizations of prototypes alone are insufficient for model transparency and reliable interpretability, and it is required to align visual and functional behaviors. Accordingly, we argue that prototype-based explanations within the “this-looks-like-that” paradigm necessitate additional safeguards—either through training constraints that enforce alignment between visual and functional semantics, adjusting design parameters (such as number of prototypes), or via post hoc auditing tools capable of identifying contradictory associations. A comprehensive investigation into such reliability guarantees is reserved for future work.

Author Contributions

Conceptualization, E.G.-C.; methodology, E.G.-C. and D.M.; software, R.C.I.; validation, E.G.-C., D.M. and R.C.I.; formal analysis, E.G.-C. and D.M.; investigation, E.G.-C., R.C.I., and D.M.; data curation, R.C.I.; writing—original draft preparation, E.G.-C. and R.C.I.; writing—review and editing, E.G.-C. and D.M.; visualization, R.C.I.; supervision, E.G.-C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author(s).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Rudin, C.; Chen, C.; Chen, Z.; Huang, H.; Semenova, L.; Zhong, C. Interpretable machine learning: Fundamental principles and 10 grand challenges. Stat. Surv. 2022, 16, 1–85. [Google Scholar] [CrossRef]
Li, O.; Liu, H.; Chen, C.; Rudin, C. Deep Learning for Case-Based Reasoning through Prototypes: A Neural Network that Explains Its Predictions. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Ras, G.; Xie, N.; Van Gerven, M.; Doran, D. Explainable deep learning: A field guide for the uninitiated. J. Artif. Intell. Res. 2022, 73, 329–396. [Google Scholar] [CrossRef]
Deng, L. The mnist database of handwritten digit images for machine learning research. IEEE Signal Process. Mag. 2012, 29, 141–142. [Google Scholar] [CrossRef]
Chen, C.; Li, O.; Tao, D.; Barnett, A.; Rudin, C.; Su, J. This Looks Like That: Deep Learning for Interpretable Image Recognition. In Proceedings of the NeurIPS, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Stefenon, S.F.; Singh, G.; Yow, K.C.; Cimatti, A. Semi-ProtoPNet Deep Neural Network for the Classification of Defective Power Grid Distribution Structures. Sensors 2022, 22, 4859. [Google Scholar] [CrossRef] [PubMed]
Singh, G.; Yow, K.C. Object or Background: An Interpretable Deep Learning Model for COVID-19 Detection from CT-Scan Images. Diagnostics 2021, 11, 1732. [Google Scholar] [CrossRef]
Singh, G. One and one make eleven: An interpretable neural network for image recognition. Knowl.-Based Syst. 2023, 279, 110926. [Google Scholar] [CrossRef]
Nauta, M.; van Bree, R.; Seifert, C. Neural Prototype Trees for Interpretable Fine-Grained Image Recognition. In Proceedings of the CVPR, Nashville, TN, USA, 19–25 June 2021. [Google Scholar]
Gulshad, S.; Long, T.; van Noord, N. Hierarchical Explanations for Video Action Recognition. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 18–22 June 2023; pp. 3703–3708. [Google Scholar] [CrossRef]
Koh, P.W.; Nguyen, T.; Tang, Y.S.; Mussmann, S.; Pierson, E.; Kim, B.; Liang, P. Concept Bottleneck Models. In Proceedings of the ICML, Virtual Event, 13–18 July 2020. [Google Scholar]
Papernot, N.; McDaniel, P. Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning. In Proceedings of the ICML, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
Sabour, S.; Frosst, N.; Hinton, G.E. Dynamic Routing Between Capsules. In Proceedings of the NeurIPS, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Zhang, N.; Donahue, J.; Girshick, R.; Darrell, T. Part-based R-CNNs for Fine-grained Category Detection. In Proceedings of the ECCV, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
Bancroft, S. An Algebraic Solution of the GPS Equations. IEEE Trans. Aerosp. Electron. Syst. 1985, AES-21, 56–59. [Google Scholar] [CrossRef]
Abel, J.; Chaffee, J. Existence and Uniqueness Analysis for the GPS Equations. IEEE Trans. Aerosp. Electron. Syst. 1991, 27, 952–956. [Google Scholar] [CrossRef]
Norrdine, A. An Algebraic Solution to the Multilateration Problem. In Proceedings of the International Conference on Indoor Positioning and Indoor Navigation (IPIN), Sydney, Australia, 13–15 November 2012; pp. 1–6. [Google Scholar]
Levenberg, K. A Method for the Solution of Certain Non-Linear Problems in Least Squares. Q. Appl. Math. 1944, 2, 164–168. [Google Scholar] [CrossRef]
Marquardt, D.W. An Algorithm for Least-Squares Estimation of Nonlinear Parameters. J. Soc. Ind. Appl. Math. 1963, 11, 431–441. [Google Scholar] [CrossRef]
Moore, E.H. On the Reciprocal of the General Algebraic Matrix. Bull. Am. Math. Soc. 1920, 26, 394–395. [Google Scholar]
Penrose, R. A Generalized Inverse for Matrices. Proc. Camb. Philos. Soc. 1955, 51, 406–413. [Google Scholar] [CrossRef]
Bohle, M.; Singh, N.; Fritz, M.; Schiele, B. B-Cos Alignment for Inherently Interpretable CNNs and Vision Transformers. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 4504–4518. [Google Scholar] [CrossRef] [PubMed]
Wolf, T.N.; Kavak, E.; Bongratz, F.; Wachinger, C. SIC: Similarity-Based Interpretable Image Classification with Neural Networks. arXiv 2025, arXiv:2501.17328. [Google Scholar]
Turbé, H.; Bjelogrlic, M.; Mengaldo, G.; Lovis, C. ProtoS-ViT: Visual foundation models for sparse self-explainable classifications. arXiv 2024, arXiv:2406.10025. [Google Scholar]
Singh, G.; Frizzo Stefenon, S.; Yow, K.C. The shallowest transparent and interpretable deep neural network for image recognition. Sci. Rep. 2025, 15, 13940. [Google Scholar] [CrossRef]
Leavitt, M.L.; Morcos, A. Selectivity considered harmful: Evaluating the causal impact of class selectivity in DNNs. arXiv 2020, arXiv:2003.01262. [Google Scholar] [CrossRef]
Pearce, M.T.; Dooms, T.; Rigg, A.; Oramas, J.M.; Sharkey, L. Bilinear MLPs enable weight-based mechanistic interpretability. arXiv 2025, arXiv:2410.08417. [Google Scholar]

Figure 1. Deep Prototype-Based Network architecture.

Figure 2. Reconstructions of prototypes reveal that the uppermost panel (layer 1) exhibits digit-specific structures, while deeper layers (from top to bottom) tend to converge toward patterns that are less interpretable by humans due to their combination.

Figure 3. Prototypes derived from the final layer in Experiment 2, presented alongside the corresponding weight matrix.

Figure 4. Prototypes obtained from those transferred from a 4-layer DPBN in Experiment 2, shown together with the corresponding weight matrix. The prototypes are indexed by row and column from 0 to 14.

Figure 5. Reconstructed prototypes from the last layer showing a low degree of interpretability.

Figure 6. Softmax-normalized weight matrix of the Deep Prototype Network. Rows correspond to prototypes, columns to the two classes. Prototype #4 (bottom row) is visually a digit “0’’ yet assigns the bulk of its weight to the other class.

Figure 7. Reconstructed deepest-layer prototypes after training. The last prototype (right-most) is a clear “0’’ despite its functional association with the other class.

Figure 8. Twelve test images nearest to Prototype #4 in latent space, all correctly labeled as class “0’’.

Table 1. Extreme matrices and their corresponding normalized negative entropy (NNE) values.

Matrix	NNE
Identity $I_{K}$ (perfectly sharp)	$1.000$
Uniform $(\frac{1}{K} 1)$ (fully diffuse)	$0.000$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

García-Cuesta, E.; Manrique, D.; Ionescu, R.C. Interpretable Deep Prototype-Based Neural Networks: Can a 1 Look like a 0? Electronics 2025, 14, 3584. https://doi.org/10.3390/electronics14183584

AMA Style

García-Cuesta E, Manrique D, Ionescu RC. Interpretable Deep Prototype-Based Neural Networks: Can a 1 Look like a 0? Electronics. 2025; 14(18):3584. https://doi.org/10.3390/electronics14183584

Chicago/Turabian Style

García-Cuesta, Esteban, Daniel Manrique, and Radu Constantin Ionescu. 2025. "Interpretable Deep Prototype-Based Neural Networks: Can a 1 Look like a 0?" Electronics 14, no. 18: 3584. https://doi.org/10.3390/electronics14183584

APA Style

García-Cuesta, E., Manrique, D., & Ionescu, R. C. (2025). Interpretable Deep Prototype-Based Neural Networks: Can a 1 Look like a 0? Electronics, 14(18), 3584. https://doi.org/10.3390/electronics14183584

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Interpretable Deep Prototype-Based Neural Networks: Can a 1 Look like a 0?

Abstract

1. Introduction

2. State of the Art

3. Deep Prototype-Based Network Architecture

3.1. The Forward Encoding Flow

3.2. The Backward Decoding Flow

3.3. Guided Prototype Learning

4. Experiments and Results

4.1. NNE Score

4.2. Experiment 1: Baseline Architecture

4.3. Experiment 2: Model Distillation

4.4. Experiment 3: Cana 1 Look like a 0?

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI