Dimensionality reduction with variational encoders based on subsystem purification

Efficient methods for encoding and compression are likely to pave way towards the problem of efficient trainability on higher dimensional Hilbert spaces overcoming issues of barren plateaus. Here we propose an alternative approach to variational autoencoders to reduce the dimensionality of states represented in higher dimensional Hilbert spaces. To this end we build a variational based autoencoder circuit that takes as input a dataset and optimizes the parameters of Parameterized Quantum Circuit (PQC) ansatz to produce an output state that can be represented as tensor product of 2 subsystems by minimizing Tr(\rho^2). The output of this circuit is passed through a series of controlled swap gates and measurements to output a state with half the number of qubits while retaining the features of the starting state, in the same spirit as any dimension reduction technique used in classical algorithms. The output obtained is used for supervised learning to guarantee the working of the encoding procedure thus developed. We make use of Bars and Stripes dataset (BAS) for an 8x8 grid to create efficient encoding states and report a classification accuracy of 95% on the same. Thus the demonstrated example shows a proof for the working of the method in reducing states represented in large Hilbert spaces while maintaining the features required for any further machine learning algorithm that follow.


Introduction
Variational quantum algorithms in the NISQ [15] era provides a promising route towards developing useful algorithms that allow for optimizing states in higher dimensional spaces by tuning polynomial number of parameters.The most prominent techniques within variational methods include Variational Quantum Eigensolver (VQE) [14], Quantum Approximate Optimization Algorithm (QAOA) [9] and other classical machine learning inspired ones.We ask the readers to refer [18] for an exhaustive study on quantum machine learning with applications in chemistry [17,26,11], physics [19],supervised image classification [6] and optimization [20].Within the context of optimization and machine learning in general, some of the major problems that needs to be addressed includes encoding classical data, finding an expressible enough ansatz (Expressibility) [21], efficiently computing gradients(Trainability) [7], generalizability [1].These problems are interlinked and thus not treated independently in general.
As we move away from the NISQ era towards deep Parameterized Quantum Circuits (PQC), one of the major problems with regards to trainability that needs addressing is the problem of vanishing gradients referred to as barren plateaus [13].This might be an affect of working with large number of qubits [13], expressive circuit ansatz [10], noise induced [23] or the use of global cost functions in the learning [3].Having efficient procedures to reduce the dimensionality of input quantum state representation will pave a path in developing efficient encoding schemes that could later be used as inputs to other machine learning algorithms where the cost functions on higher di-mensional spaces with expressive ansatz are less likely to be trainable.To this end we develop machine learning techniques that allow for compact representations of given input quantum state.
Within the classical machine learning community autoencoders have been effectively used to develop low dimensional representation of samples generated from a given probability distributions [12].Inspired from these techniques work on Quantum Autoencoders [16,22] have allowed for people to develop compact representations against a fixed finite state.It is not clear that such tensor product states with a fixed finite state is always possible and retains the maximal possible information.Here we show that if one were to relax the condition towards maintaining a fixed finite state, a better compact representation can be generated that can be post processed towards classification.We develop techniques to create subsystem purifications for a given set of inputs, and follow it by creating superpositions of these purifications indexed by the subsystem number.This representation is further used for doing classification achieved by applying variational methods over parameterized quantum circuits restricted to this compact representation and show the learning of the method.We apply an ansatz to create subsystem purification on Bars and Stripes (BAS) dataset and show that one can reduce the number of qubits required to represent the data by half and achieve a 95% classification accuracy on the Bars and Stripes (BAS) dataset.The demonstrated example shows a proof for the working of the method in reducing states represented in large Hilbert spaces while maintaining the features required for any further machine learning algorithm that follow.The scheme thus proposed can be extended to problems with states in large Hilbert spaces where dimensionality reduction plays a key role with regards to the trainability of the parameterized quantum circuit.

Method
Given an ensemble of input states, E = {|ψ i }, the objective is to construct a low dimensional representation of states sampled from this distribution E. Let |ψ i be a state over n A +n B qubits.We design a protocol that allows for us to create an equivalent compact representation of |ψ with max(n A , n B ) + 1 qubits.To simplify the discussion lets assume that n A = n B and thus we create a representation using half the qubits.We do this in 2 stages.

Stage 1:
In the first stage we apply a unitary U ( θ) that decomposes To produce such a tensor product structure we could minimize the entropy on either of subsystems A or B till we get zero.Thus we could optimize over the cost function, (1) where tr A represents the tracing operation over the qubits of subsystem A, < .> {|ψ } represents the averaging over the {|ψ } and S(ρ) = tr(ρlog(ρ)) is the entropy of a given density matrix ρ.The cost function C B ( θ) attains a maximum value equal to log(n B ) when ρ B is maximally mixed, and equal to 0 when ρ B is a pure state.Fig 1 shows a schematic representation of the ansatz used for U (θ).
Variational quantum algorithms have been studied in the past towards creating thermal systems by minimizing the output state against the free energy [24,4].The main problem tackled in these papers involves developing techniques that allows one to compute the gradients of Entropy required to be optimized over the training.The issue arises from not having exact representations that can compute logarithm of given density matrix efficiently.Further more to avoid numerical instabilities in the entropy function arising from the density matrix of pure states being singular, here we alternatively maximize over the cost function, where The parameters θ are variationally optimized to obtain θ * = argmax θ C AB ( θ).If C AB ( θ) reaches an optimal value of zero, we can express |ψ AB = |φ A |η B , thus expressing a state with 2 n+m degrees of freedom effectively using 2 n + 2 m degrees of freedom.Having expressed the input state as a tensor product of subsystems we now move to stage 2 of the algorithm.
Stage 2: Note that the above representation still makes use of 2n qubits to capture the features of |ψ .We now show how this representation can be compressed to using n + 1 qubits.We show how using an additional ancillary qubit, amplitude amplification and projective measurements one can create the state |0 |φ + |1 |η starting from 1

Output:
Thus we have successfully managed to convert the input state |ψ to |0 |φ + |1 |η , as required.Note that this procedure is reversible and hence the representation is unique, thus preserving all information content encoded into input state |ψ .To show its reversible, one just needs to take 2 copies of the output state |0 |φ +|1 |η , measure the corresponding ancilla to project out |φ |η , and then apply the inverse of U ( θ) giving back |ψ .Thus the encoding scheme allows for us to create a representation of input state |ψ with 2n qubits into only n + 1 qubits.This procedure can be repeated iteratively as long as the output state vectors permit a size reduction quan-

Results
To demonstrate the working of the method described above, we pick a toy dataset with images of Bars and Stripes (BAS) and build a compact representation of it.The BAS dataset we consider is a square grid with either some columns being only vertically filled (Bars) or some rows being horizontally filled (Stripes) [2].One can easily generate such a supervised dataset and realize that the distribution from which these images are sampled has a low entropy characterization.We randomly sample 1000 data points from a grid size of 16x16 BAS dataset consisting of 131068 datapoints represented using amplitude encoding on 8 qubits.
Applying the protocol described above we reduce the representation of the state into a tensor product of 2 subsystem of equal sizes.Fig 4,6 shows the learning of optimal parameters θ as the cost function falls.We use standard gradient descent [25] approach in doing the training .Note the cost function drops to zero implying that the representation thus created is exact with a lossless transformation created by U ( θ).For the 16x16 grid case, the ansatz U ( θ) is made of D=5 layers, while for the 8x8 grid is made of D=3 layers.At this point we apply a layer of swap gates to reduce the 8 qubit representation of 16x16 grid samples into 5 qubits and the 6 qubit representation of 8x8 grid samples into 4 qubits.
We now use this as input for doing supervised classification.We use approx 80% of the samples from the output of the encoded samples for training and keep the remaining 20% of the samples for testing.An ansatz V ( θ) with the same number of qubits as that of the input samples is trained, with the sign of expectation value of pauli-Z operator being used as a label for differentiating between bars and stripes.Input image is classified as a bars image if the expectation value is positive, and stripes image if negative.We use the sum of 2 norm errors over the dataset labels (1 for bars and −1 for stripes) as the cost function to be minimized over, i.e, (3) where the summation index i labels the dataset, l i refers to the labels corresponding to the sample input and |ψ i is used to denote the compact representation of the state that the above encoding scheme provides.For the 8x8 grid, a total 508 bars and stripe images are produced with half of them belonging to each category.We use 400 of these samples for training and 108 samples for testing.Fig 8 shows the cost of optimizing the parameters of V ( θ) as a function of the number of iterations.We get a 95% accuracy on the testing data, showing that the method use to generate the compact representation did not destroy the features of the input state.

Runtime analysis of Encoding scheme
Here we shall analytically compute the required runtime for the above described protocol.Lets assume that the input ensemble of N quantum states over n qubits supports a compact representation, allowing us to use the above protocol to encode with half the number of qubits.Let our ansatz to be optimized be made of d layers.Thus stage 1 involves optimization over 2ndN parameters for N samples.Using destructive swap test to compute fidelity with an error we would require O(1/ 2 ) samples.Thus the runtime complexity scales evaluating O(ndN/ 2 ) quantum circuits per iteration for Stage 1. Stage 2 involves projection onto the state with largest overlap.The overlap achievable onto any given computational basis |g can be maximized using grovers with the worst case runtime of O(2 n/2 ) steps with query complexity of O(1/ ).Thus the overall runtime is bounded by O(N (T nd/ 2 + 2 n/2 / )), where T is the number of iterations required in stage 1 optimization.In contrast the runtime of a classical autoencoder to prepare a compact state is O(N T d2 n ).We show in the appendix A, how for certain cost functions, using a specialized ansatz and carefully prepared index registers one can get around the exponential cost incurred in preparing the compact superposition state for machine learning tasks.We discuss a scheme that allows for a compact representation of states in higher dimensional Hilbert spaces using half the number of qubits.The output thus created serves as good starting states for any further machine learning algorithm that might follow.The protocol is based on designing a quantum circuit that allows creating tensor product subsystems and demonstrate results on bars and stripes datasets for 8x8 grid and 16x16 grid.We further use this output to create compact representations with half the number of qubits as compared to the starting state.To show that this representation is a lossless encoding we use it to do supervised learning using variational circuits on the entire dataset of 8x8 grid and reproduce a 95% accuracy on the training dataset (consisting of 108 samples).Unlike quantum autoencoders where the compact representations rely on being able to optimize against a fixed garbage state, here the relaxed restriction on the tensor product helps provide compact representations in cases where a fixed garbage state would not be feasible.Further investigations on what the entanglement of the subsystems reveal about the probability distribution from which the data is sampled can lead to other useful applications of this protocol.One might also be interested in carrying out machine learning by using weighted quantum circuits that run on the subsystems independently and compare its performance against the compact representations created thereby.One can also imagine using low entropic entangled states that stage 1 protocol outputs as input states for entanglement forging [8] and look for useful applications with the same.We would like to conclude by saying that, efficient methods for encoding and compression are likely to pave way towards the problem of efficient trainability on higher dimensional Hilbert spaces, and this work serves as a step towards that direction.

Figure 1 :
Figure 1: Ansatz used for Encoding circuit U ( θ) in stage 1.The circuit shows D repeating layers of a unit consisting of R y gates parameterized by one independent angle each and a ladder of CN OT operators.The circuit is optimized over the dataset to generate equivalent states with subsystem tensor product structure.Thus we obtain U ( θ) |ψ = |φ ⊗ |η .φ is first subsystem and shall be indexed later with an ancilla state |0 and |η is the second subsystem which shall be indexed by |1

√ 2 ( 1 √(1+c 2 )
|0 + |1 ) |φ |η .To do this we apply a CSWAP (controlled swap/Fredkin) gate acting on the qubits of system A and B. Thus we get |0 |φ |η + |1 |η |φ .If |η and |φ are not orthogonal states, then there exists atleast one basis element |g in the computational basis with a nonzero coefficient in both these states.Without a loss of generality lets assume that the measurement collapses onto |g giving raise to 1 √ 1+c 2 (|0 |φ + ce iα |1 |η ) ⊗ |g , where c and α are real numbers.The factor ce iα is generated from the relative difference in the coefficients of the state corresponding to |g .To ensure that the state collapse to a specific garbage state |g , we could choose |g to have the maximum probability among all the basis projections.The factor ce iα can now be absorbed as a global normalization if the ancilla register was prepared in the state (ce iα |0 +|1 ).This ensures that the output of this stage is |0 |φ + |1 |η .To extend this description to the case when |φ and |η are orthonormal, one just needs to apply a transformation controlled on the ancilla register to break this condition.Fig 3 shows a schematic representation of the main steps involved in creating a superposition with the ancilla register being used as an index to the subsystem outputs of Stage 1.

Figure 3 :
Figure 3: A schematic representation of the steps involved in Stage 2 to prepare the superposition state using an extra ancilla from the product state output of Stage 1. Controlled swap gates are used to generate |0 |φ + |1 |η .Following this the second subsystem is measured in the computational basis imparting relative phase and amplitude (not shown in the above representation)

Figure 4 :Figure 5 :
Figure 4: Stage 1: Training cost vs iterations for 16x16 grid.The unitary circuit thus trained creates equivalent tensor product representations using two equal half subsystems of 4 qubits.Note that 1−Cost eventually saturates at 0 allowing us to create pure state product subsystem

Figure 6 :Figure 7 :
Figure 6: Stage 1: Training cost vs iterations for 8x8 grid..The unitary circuit thus trained creates equivalent tensor product representations using two equal half subsystems of 3 qubits.Note that 1−Cost eventually saturates at 0 allowing us to create pure state product subsystem

Figure 8 :
Figure 8: Classification cost vs iterations for 8x8 grid.Figure shows the saturation of classification cost as per eqn 3 after 13 iterations.

Figure 9 :
Figure 9: Classification cost vs i |∆ θ i | 1 for 8x8 grid.Figure shows that the variation in angle as computed by the gradient of eqn 3 is minimized as one gets near to the saturation point.(|∆ θ| 1 measures the 1 norm increase in the angle contribution from the computed gradients with increasing epochs Figure 9: Classification cost vs i |∆ θ i | 1 for 8x8 grid.Figure shows that the variation in angle as computed by the gradient of eqn 3 is minimized as one gets near to the saturation point.(|∆ θ| 1 measures the 1 norm increase in the angle contribution from the computed gradients with increasing epochs