Transfer Entropy and O-Information to Detect Grokking in Tensor Network Multi-Class Classification Problems
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis is an interesting study which I will not summarize. The Introduction, Results and Discussion are well written. The major issue with the manuscript is that the methods central to the manuscript, tensor networks and matrix product states (MPSs), are not described in sufficient detail in the Materials and Methods section for a reader who is not familiar with this approach to understand without going to a number of other cited papers. This issue is true for the relevant schematic figures (e.g., Figure 2 and the illustration shown on the top of p. 7) as well as details associated with Equations (1) - (15). For example Figure 2 should be explained more fully (e.g., what do the pink dots represent?) so that each feature of the diagrams and diagramatic equations are defined and made more understandable to a reader that is not familiar with MPSs. In Eq. (2), the aj and sj are not defined in the text. Eq. (4) does not make sense as flW(x) is an N qubit quantum state and ylω is number (i.e., 1 if sample ω belongs to class l and 0 otherwise) that are encouraged to be equal to each other via minimization of that equation. The output of the classifier, maxcomponent|flW(x)|, should replace flW(x) in Eq. (4). What does two-site gradient descent scheme mean? What do the pink and red colored dots correspond to in the diagram shown on the top of p. 7? What values do sj and aj take in Eq. (5)? Is the ΣZl,j on line 261 the same as the SVD Σ described on line 199? This is all part of one major comment. In my view, sections 2.3 - 2.5 should be re-written in such a way that all of this is addressed and the methods are made much more clear and all the variables and diagrams are explicitly defined for a reader not familiar with MPSs. This could be accomplished by adding to the Appendix so that the paper is more self contained.
It would be interesting for the authors to comment in the Discussion on how their approach scales with data size and how competitive is this approach with the state of the art deep learning methods applied to the same data? Even a brief discussion would be helpful in putting their approach and study in fuller context.
Author Response
Comments 1: This is an interesting study which I will not summarize. The Introduction, Results and Discussion are well written. The major issue with the manuscript is that the methods central to the manuscript, tensor networks and matrix product states (MPSs), are not described in sufficient detail in the Materials and Methods section for a reader who is not familiar with this approach to understand without going to a number of other cited papers. This issue is true for the relevant schematic figures (e.g., Figure 2 and the illustration shown on the top of p. 7) as well as details associated with Equations (1) - (15). For example Figure 2 should be explained more fully (e.g., what do the pink dots represent?) so that each feature of the diagrams and diagramatic equations are defined and made more understandable to a reader that is not familiar with MPSs. In Eq. (2), the aj and sj are not defined in the text. Eq. (4) does not make sense as flW(x) is an N qubit quantum state and ylω is number (i.e., 1 if sample ω belongs to class l and 0 otherwise) that are encouraged to be equal to each other via minimization of that equation. The output of the classifier, maxcomponent|flW(x)|, should replace flW(x) in Eq. (4). What does two-site gradient descent scheme mean? What do the pink and red colored dots correspond to in the diagram shown on the top of p. 7? What values do sj and aj take in Eq. (5)? Is the ΣZl,j on line 261 the same as the SVD Σ described on line 199? This is all part of one major comment. In my view, sections 2.3 - 2.5 should be re-written in such a way that all of this is addressed and the methods are made much more clear and all the variables and diagrams are explicitly defined for a reader not familiar with MPSs. This could be accomplished by adding to the Appendix so that the paper is more self contained.
Response 1: A new appendix targets a detailed introduction of states conversion in MPS format for readers not familiar with tensor networks. Two new panels in Fig. 2 explain the diagrammatic notation of pink and red colored dots, as recalled on lines 221-223. The SVD notation on line 199 is now modified to avoid ambiguities. In Eq. (4) the predictor provides probabilistic scores and is a 3 component vector, according to its diagrammatic representation in Fig. 2(b). The cost function was introduced in Ref. [23].
Comments 2: It would be interesting for the authors to comment in the Discussion on how their approach scales with data size and how competitive is this approach with the state of the art deep learning methods applied to the same data? Even a brief discussion would be helpful in putting their approach and study in fuller context.
Response 2: Two new paragraphs targets the required comparison on lines 526-530 and scaling outlook on lines 573-575.
Reviewer 2 Report
Comments and Suggestions for AuthorsIn this manuscript ‘Transfer entropy and O-information to detect grokking in tensor network multi-class classification problems’, the authors propose a quantum-enhanced machine learning that trains dynamics of matrix product state (MPS) classifiers and applied it to three-class problems.
In details, they use both MNIST and hyperspectral satellite imagery as representative datasets. They investigate the phenomenon of grokking,and employ information-theory tools to gain deeper insights. Especially, they use transfer entropy and entanglement entropy to reveal causal dependencies between label-specific quantum masks and use O-information to capture the shift from synergistic to redundant correlations among class outputs. The results show: 1) grokking in the MNIST task coincides with a sharp entanglement transition and a peak in redundant information, 2) the overfitted hyperspectral model retains synergistic, disordered behavior.
Combining quantum and machine learning/AI is an interesting and important area to face the future computation and information processing. I believe the derivations, equations and figures should be correct in general. Therefore, I will approve its acceptance in principle after the authors consider my following points:
1) In the abstract and introduction part, what are the two abbreviations (MNIST and PRISMA) stand for? For readers not familiar with machine learning, the total name should be given when they first appear.
2) Are the systems or datasets you considered open or close systems? If they are open systems, does the tradition definitions of entropy (entanglement entropy, transfer entropy, higher order mutual entropy) are still appropriate to your cases? At least, it may have some limitation when applying.
3) Because the entropy related quantities are used, such as transfer entropy, entanglement entropy, higher order mutual entropy, etc., I suggest you introduce a little more about entropy, e.g., from von Neuman, entropy, Shannon entropy and Renyi entropy to generalized non-Hermitian Renyi entropy (Non-Hermitian Generalization of Rényi Entropy. Entropy, 2022, 24(11): 1563) and its applications (Non-Hermitian Quantum Rényi Entropy Dynamics in Anyonic-PT Symmetric Systems. Symmetry, 2024, 16(5): 584). Two to three sentences are enough. This point is also relevant to the second point above.
Author Response
Comments 1: In the abstract and introduction part, what are the two abbreviations (MNIST and PRISMA) stand for? For readers not familiar with machine learning, the total name should be given when they first appear.
Response 1: In the new version we introduce the acronyms in the Introduction section corresponding to lines 63 and 70.
Comments 2: Are the systems or datasets you considered open or close systems? If they are open systems, does the tradition definitions of entropy (entanglement entropy, transfer entropy, higher order mutual entropy) are still appropriate to your cases? At least, it may have some limitation when applying.
Comments 3: Because the entropy related quantities are used, such as transfer entropy, entanglement entropy, higher order mutual entropy, etc., I suggest you introduce a little more about entropy, e.g., from von Neuman, entropy, Shannon entropy and Renyi entropy to generalized non-Hermitian Renyi entropy (Non-Hermitian Generalization of Rényi Entropy. Entropy, 2022, 24(11): 1563) and its applications (Non-Hermitian Quantum Rényi Entropy Dynamics in Anyonic-PT Symmetric Systems. Symmetry, 2024, 16(5): 584). Two to three sentences are enough. This point is also relevant to the second point above.
Response 2-3: A new paragraph in lines 236-241 targets the inclusion of the additional references.
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors propose an MPS-based quantum machine learning model for three-class classification problems and investigate the training dynamics of this model. By comparing the MNIST dataset with a hyperspectral satellite imagery dataset, they explore how the model generalizes, how internal correlations are structured, and how the information structure evolves through training. Through these analyses, the authors demonstrate that their formalism for classification scores helps in understanding the internal representations of network restructuring and grokking transitions. The manuscript is relatively well written; however, there are several parts that require clarification and elaboration. It is necessary to revise the manuscript in light of the following points.
- Figure 2 on page 5
The explanation of Figure 2 is insufficient. The authors should provide a more detailed explanation of this figure in connection with Equation (2).
- Line 199
The authors express the “bipartition of the one-dimensional system” as 𝑀=𝑈Σ𝑉†. A more detailed explanation of each component—𝑈, Σ, and 𝑉†—is required.
- Line 217
In the figure, the authors combine Λ and 𝑉† into a single matrix denoted as 𝐵. This should be explained in more detail.
- Equation (10) on page 8
A more comprehensive explanation of Equation (10) is necessary. For example, the meaning of the symbols 𝑝 and 𝜎_𝑍 is not provided and should be clarified.
- Line 259
The expression (<𝜎_Z^{l,j}(t)>, ... , <𝜎_Z^{l,j}(t-𝜏+1)>) appears in this line. However, it only shows the first and last terms. The authors should include at least the second term to clarify the structure. For instance, if the intended expression is (<𝜎_Z^{l,j}(t)>, <𝜎_Z^{l,j}(t-1)>, ..., <𝜎_Z^{l,j}(t-𝜏+1)>), it should be written explicitly to avoid confusion for the readers.
- The authors discuss “quantum-enhanced machine learning” in this paper. It would be beneficial to include a discussion comparing the performance of their approach with a classical (non-quantum) counterpart, representing any specific improvements gained through the quantum method.
- Minor Revisions
(1) Between lines 348–350, the authors define label 0, label 1, and label 2. However, it might be more appropriate to name them label 1, label 2, and label 3 for clarity (not mandatory).
(2) It is recommended to add "MNIST" to the list of abbreviations on line 606.
Author Response
Comments 1: The explanation of Figure 2 is insufficient. The authors should provide a more detailed explanation of this figure in connection with Equation (2).
Comments 2: The authors express the “bipartition of the one-dimensional system” as ?=?Σ?†. A more detailed explanation of each component—?, Σ, and ?†—is required.
Response 1-2: A new appendix targets a detailed introduction about states conversion in MPS format and we include two new panels in Fig. 2.
Comments 3: In the figure, the authors combine Λ and ?† into a single matrix denoted as ?. This should be explained in more detail.
Response 3: A formal definition of tensor B is now introduced in line 217.
Comments 4: A more comprehensive explanation of Equation (10) is necessary. For example, the meaning of the symbols ? and ?_? is not provided and should be clarified.
Response 4: The use of joint and conditional probability densities is presented in lines 267-268, while the symbol for local magnetization expectation is introduced in line 247.
Comments 5: The expression (<?_Z^{l,j}(t)>, ... , <?_Z^{l,j}(t-?+1)>) appears in this line. However, it only shows the first and last terms. The authors should include at least the second term to clarify the structure. For instance, if the intended expression is (<?_Z^{l,j}(t)>, <?_Z^{l,j}(t-1)>, ..., <?_Z^{l,j}(t-?+1)>), it should be written explicitly to avoid confusion for the readers.
Response 5: A revised expression is now introduced in line 269.
Comments 6: The authors discuss “quantum-enhanced machine learning” in this paper. It would be beneficial to include a discussion comparing the performance of their approach with a classical (non-quantum) counterpart, representing any specific improvements gained through the quantum method.
Response 6: A new paragraph in discussion section, lines 526-530, targets the literature focused on the required comparison. We underline that the use of the term "enhanced" refers to the improved interpretability of the proposed scheme.
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors have addressed many but not all of my concerns/issues. Addition of the new Appendix A is particularly helpful. The remaining issue that I have remains a major one. It is associated with Eqs. (4) and (6).
Regarding Eq. (4), let's say one is able to train the model perfectly and the cost function C(W) = 0. This, in principle, is possible, but rarely, if ever, achieved for any given model and labeled data. Then f_W^l which is a quantum state would be equal to y_w^l which is a number, 1 if sample w belongs to class l and 0 otherwise. This is impossible and makes no sense to me. If it is correct, then it needs to be explained much more clearly as it appears to be mathematically and conceptually incorrect. The correct equation should be given by replacing f_W^l(x) with something like max_(component, label) |f_W^l(x)|. Specifically, the authors claim the predicted class label is found by the "component showing the highest absolute value |f_W^l(x)|". This is the predicted output of the model. This output of the model is what should be in the cost function Eq. (4) instead of the quantum state f_W^l(x). This is standard for cost functions.
Regarding Eq. (6), because a derivative of the cost function shown in Eq. (4) is being taken with respect to the coefficients of f_W^l(x) shown in Eq. (5), we run into a related issue. A quantum state is put on the same footing as a number, y_w^l. A tensor product of Φ(x_w) with f_W^l(x_w) makes sense. A tensor product of Φ(x_w) with a number, y_w^l does not make sense.
This needs to be corrected or clarified in the text. If I'm missing something, please explain this clearly to me including using mathematical arguments to make your case. It would be helpful to include this in the text as well as I believe other readers would be confused as well.
A minor suggestion is to define bond dimension χ when it is introduced on line 182.
Author Response
We sincerely thank the referee for pointing out the notation inconsistency, which is now targeted by a proper definition of y_omega^ell as two dimensional vector in the comment of Eq. (4). Also the tensor product in Eq. (6) is well posed.
The specific form of the mean squared error in Eq. (4) is introduced in Ref. [23] corresponding to the related Eq. (6) (according to equation numbering of the related arXiv:1605.05775v2). This definition of the optimized cost function imposes that we compare the normalized amplitude score with the known labels, with a final choice of predicted labels by adopting the maximum of components absolute values.
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors have adequately revised the manuscript by taking into account the suggestions I proposed, and they have also provided appropriate responses. As a result, the content of the manuscript has been significantly improved. Based on these considerations, I recommend that the manuscript be accepted for publication in Technologies.
Author Response
We sincerely thank the referee for the positive evaluation of our work.
Round 3
Reviewer 1 Report
Comments and Suggestions for AuthorsWe are actually getting very close to resolving this issue. But it still remains an issue in my view. The authors rebuttal is as follows:
"The specific form of the mean squared error in Eq. (4) is introduced in Ref. [23] corresponding to the related Eq. (6) (according to equation numbering of the related arXiv:1605.05775v2). This definition of the optimized cost function imposes that we compare the normalized amplitude score with the known labels, with a final choice of predicted labels by adopting the maximum of components absolute values."
As you write above, you "compare the normalized amplitude score with the known labels, with a final choice of predicted labels by adopting the maximum of components absolute values". All I continue to write is that this is not explicitly in Eq. (4). It's a further implied procedure. I looked up the highly cited preprint that you cite above and is your reference [23]. You are indeed using their definition and effective notation. Their Eq. (6) is your Eq. (4). Interestingly, I also looked up your reference [33] which is quoted at the end of the statements further defining Eq. (4). The authors of reference [33] are doing exactly what I'm arguing you should do, explicitly (rather than implicitly) define the mapping from f_W^l(x_w) to your classifier which should be compared to y_w^l or the label vector as you now have it. If you look at Section A. of your reference [33], "Machine learning with tree tensor networks, CP rank constraints, and tensor dropout", you'll see that they define things as you have all the way up to the definition of f(x) (both your and their Eq. (3)). However, they then make their loss function explicit and properly defined by taking the Born rule of f(x), an unnormalized quantum state of dimension L, to define the probability of x belonging to class l: p(l|x) = |f_l(x)|^2/||f(x)||^2. The class label associated with x is found by taking argmax_l p(l|x). Importantly, the Born rule effectively maps the learned state f(x) into a classifier p(l|x). Their loss function is the negative log likelihood L = - 1/|D| ∑_{x,l} ln p(l|x). The only difference with your cost function is that you're using a mean square error cost or loss function. But you still need to map your f(x) to your classifier |f_W^l(x_w)| in Eq. (4) as the authors of [33] did for their loss function. I think it's worth noting that the authors of the pre-print were able to do what they wanted as it was not peer reviewed. They could define their cost function implicitly while the authors of [33] had to go through peer review.
Author Response
We sincerely thank the referee and we fully agree on the issue concerning the definition of the cost function, which constrains the gradient descent procedure and indeed deserves further investigation. The discrepancies between Ref. [23] and [33] need to be explored in more detail. For this reason, we have added the following remark below Eq. (4): ‘It should be noted that in the definition of the cost function there exists a difference between Ref. [23] and [33], namely the use of normalized amplitude scores as opposed to Born-rule conditional class probabilities. A more detailed investigation of these discrepancies is required in future research.’
Round 4
Reviewer 1 Report
Comments and Suggestions for AuthorsGiven that the authors are closely following the formalism of E. M. Stoudenmire et al. "Supervised Learning with Tensor Networks" and Stoudenmire et al. don't use kets in their definition of φsj(xj) or fl(xn) as the authors do in their "equivalent" Eqs. (1) and (3), I suggest adding sentences along the lines of the following after "The predicted label is determined by the component showing the highest absolute value |flW(x)| [33].":
"Operationally, we convert our the kets shown in Eq. (4) and subsequent equations below to vectors that match the dimensions of ylω when minimizing the cost function, similar to [23,X] whose formalism we closely follow. Notably, the authors of [23,X] define their feature maps in terms of vectors while we use ket notation as shown in Eq. (1)."
I would slightly modify your next sentence, which you just added, to start "Additionally, it should be noted..." and instead of just [23] add reference X [23,X] where X here and above refers to Stoudenmire et al. NeurIPS publication of the preprint. It's noteworthy that they were asked to remove mention of the qubit in the NeurIPS paper after their Eq. (3) and instead replaced that text stating the [cos, sin] form of the feature map "is motivated by “spin” vectors encountered in quantum systems". This may have been for a more general audience that includes computer scientists. The important point is you should make clear that you're comparing what appears to be a complex quantum state written in terms of kets to simple label vectors in the cost function Eq. (4) and that a further operation is needed to make that explicit. The text I've added attempts to resolve this. For convenience, the BibTeX reference [X] is pasted below:
@inproceedings{NIPS2016_5314b967,
author = {Stoudenmire, Edwin and Schwab, David J},
booktitle = {Advances in Neural Information Processing Systems},
editor = {D. Lee and M. Sugiyama and U. Luxburg and I. Guyon and R. Garnett},
pages = {},
publisher = {Curran Associates, Inc.},
title = {Supervised Learning with Tensor Networks},
url = {https://proceedings.neurips.cc/paper_files/paper/2016/file/5314b9674c86e3f9d1ba25ef9bb32895-Paper.pdf},
volume = {29},
year = {2016}
}
Author Response
We sincerely thank the referee for the suggested paragraph, which is inserted in the final version of the paper. We appreciate the additional reference, which is included as well.
Round 5
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors have addressed all my concerns.
