Fast Characterization of Input-Output Behavior of Non-Charge-Based Logic Devices by Machine Learning

: Non-charge-based logic devices are promising candidates for the replacement of conventional complementary metal-oxide semiconductors (CMOS) devices. These devices utilize magnetic properties to store or process information making them power efﬁcient. Traditionally, to fully characterize the input-output behavior of these devices a large number of micromagnetic simulations are required, which makes the process computationally expensive. Machine learning techniques have been shown to dramatically decrease the computational requirements of many complex problems. We use state-of-the-art data-efﬁcient machine learning techniques to expedite the characterization of their behavior. Several intelligent sampling strategies are combined with machine learning (binary and multi-class) classiﬁcation models. These techniques are applied to a magnetic logic device that utilizes direct exchange interaction between two distinct regions containing a bistable canted magnetization conﬁguration. Three classiﬁers were developed with various adaptive sampling techniques in order to capture the input-output behavior of this device. By adopting an adaptive sampling strategy, it is shown that prediction accuracy can approach that of full grid sampling while using only a small training set of micromagnetic simulations. Comparing model predictions to a grid-based approach on two separate cases, the best performing machine learning model accurately predicts 99.92% of the dense test grid while utilizing only 2.36% of the training data respectively.


Introduction
The scaling of conventional complementary metal-oxide semiconductors (CMOS) is reaching its limit [1] in accordance with Moore's prediction [2], introducing limitations and challenges to the semiconductor industry. As a result, various new concepts have emerged that aim to extend the semiconductor industry beyond CMOS technology [3,4]. Non-charge-based logic devices are one of the leading concepts [5] as these devices are power efficient and ultra-compact [6]. These devices can operate at high frequencies and offer new features such as non-volatility and low-voltage operation [5]. A number of such devices have been benchmarked in various publications for low-power applications [6][7][8][9][10][11].
With the need for solutions beyond CMOS, the research and development of novel non-charge-based logic devices have seen a great deal of interest in the past decade [3,4]. These logic devices rely on material properties to store information or perform logical operations. Nano-magnetic logic (NML), first introduced in [12,13] is a prominent concept in this category. This concept defines the state variable as magnetization direction (perpendicular magnetization) and information is processed through a dipolar coupling between nano-magnets. This allows computation to take place without passing any electric currents, making NML devices consume ultra-low power [6]. However, these devices possess certain limitations: Their operating frequency is restricted to about 3 MHz and the physical size to around 200 nm × 200 nm [14]. Contrary to the NML concept, a novel logic scheme was proposed in [15] based on the concept of bistable canted magnetization states. This scheme utilizes direct exchange interaction between two canted regions to perform logic operations and proves to be fast and power-efficient in comparison to other spin-based logic schemes [8,14].
The performance and ability of these devices to perform logic are dependent on various dynamics such as input field conditions and magnetization behavior. The key to characterizing the behavior of a new design is to identify input conditions for which the logic device behaves as desired. Traditionally, this is carried out by running a wide range of micromagnetic simulations (full grid sampling) using a simulator (such as Object-Oriented Micromagnetic Framework (OOMMF) or mumax [16,17]) for the micromagnetic system under study. The complexity of these devices increases with the number of logic structures, therefore, making simulations severely computationally expensive. Hence, there is a need to characterize these devices with minimal computational requirements.
Data-efficient machine learning (DEML) techniques have proven useful at reducing the computational requirements of a variety of engineering problems [18][19][20][21]. These techniques can be applied to micromagnetic problems for various objectives. One of the key applications of DEML in micromagnetics could be to characterize the behavior of a new design. The behavior of a device can be defined as operating if it performs a useful logical process and defined as not operating if it does not behave in a desired manner. In particular, in machine learning, such problems are identified as a classification problem that aims to separate a set of inputs into distinct groups (or classes). This is achieved by training a classifier or a model to a set of (training) data. Traditionally, this is a fixed data set. However, when data is expensive, the training data set can be generated via an adaptive sampling strategy. The adaptive algorithm starts with a small initial training data and adaptively enriches the training data by adding new samples from interesting regions in the design space. Henceforth, the trained classifier can be used to predict labels on any new unlabeled data. The main advantage of using DEML techniques over traditional practice is that full grid sampling is not required and only a small training set is sufficient to characterize device behavior. This significantly expedites the characterization of device behavior.
To apply novel DEML techniques to micromagnetic devices, we have considered a device that utilizes direct exchange interaction between two canted regions to perform logic operations [15]. Several state-of-the-art sampling strategies are combined with machine learning classification models. This paper evaluates the performance of the Explicit Design Space Decomposition (EDSD), Neighborhood-Voronoi (NV), Probability of Feasibility (PoF), and Entropy [18,[22][23][24] sampling strategies. The performance of each technique is compared by three classifiers: Support Vector Machines (SVM), Gaussian process (GP), and Logistic regression (LR) [25][26][27] built on a training data that is obtained by adaptive sampling strategies. The preliminary analysis of the problem is presented in [28]. In the next section, various adaptive sampling algorithms and classification procedure are discussed.

Classification Methods
An adaptive sampling algorithm is used to intelligently select new training data in the input space in a sequential way. The adaptive sampling process can be model-dependent or model-independent depending on the sampling criteria or information utilized in the sampling process.
In the context of this work, we have used various adaptive sampling schemes that perform exploration and/or exploitation in the design space (NV, EDSD, PoF, and Entropy [18,[22][23][24]). These techniques are discussed in the following subsections.

Neighborhood-Voronoi
Neighborhood-Voronoi is derived from the LOLA-Voronoi algorithm [29]. It is a model independent algorithm that requires no intermediate model construction during the selection of new training data. It has two components: Exploration (space-filling) and exploitation (refining boundaries), which are combined to identify boundaries of different class labels in the input space. NV maintains a balance between exploration and exploitation components, which allows the identification of previously undiscovered regions in the input space. One of the key advantages of using the NV algorithm is that no intermediate model is required during the selection of new training samples, which makes NV extremely efficient to execute. Given a set of K points, the exploration ensures that the input space is sampled as evenly as possible. A Voronoi-tessellation partitions the plane into C k cells and the corresponding volume of each Voronoi cell is computed and assigned a score V(x k ) (Equation (1)). Cells with larger relative volumes correspond to sparse areas and a higher score is assigned: The NV-exploitation component refines the boundaries between different classes by favoring new samples in those regions. It computes a neighborhood of N points x n k N n=1 for each chosen point x k . It should ensure that all neighbors are located closest to the chosen point while at the same time far apart from each other. Once the neighborhood is constructed, the labels of all neighbors L(x n k ) are compared for any disagreement (mismatch). Any disagreement corresponds to the boundary region and a higher score W(x k ) is assigned to that Voronoi cell (Equation (1)). Finally both scores are combined G(x) for each Voronoi cell and each cell is ranked (Equation (2)). The next sample location is then selected from the highest ranked cell. This is achieved by generating t random points in the ranked voronoi cell for each x k , and one point which is far away from other existing samples are chosen (Equation (3)).The process is repeated until the input region is sufficiently covered.

Explicit Design Space Decomposition
Explicit Design Space Decomposition [22] is a model-dependent technique that identifies boundaries between different classes. It requires intermediate models to be built during the selection of new samples. These models are explicitly used to define nonlinear boundaries or disjoint regions in the input space. These boundaries are treated as a limit state function/optimization constraint which is iteratively refined by adding new samples selected from regions where the misclassification probability is the highest. The reconstruction of classification boundaries continues until a converging criterion is met. Typically, EDSD uses Support Vector Machines (SVMs) to construct a limit state function. The SVM algorithm can efficiently handle discontinuities in the region. A SVM limit state function can be defined as: where b is a scalar quantity which is also noted as the bias and λ i are the Lagrange multipliers. K is the SVM kernel function. Equation (9) can be used to classify any given arbitrary point in the design space depending on the positive or negative condition of s. The selection of new training points begin by first generating initial training samples using Design of Experiments (DOE). In this work Centroidal Voronoi Tesellations (CVT) [30] is used to generate initial samples. The binary output to these observations is then calculated and a SVM limit state function is constructed. The SVM decision boundary is continuously refined by sampling new points on the SVM decision boundary that maximizes the distance to the nearest training sample. The process continues until the convergence criteria is met. The complete algorithm is shown in Figure 1.

Probability of Feasibility
Probability of Feasibility is a model-based approach for adaptive design. In a classification problem [31], it gives a probabilistic estimate of a probabilistic classifier. The PoF criterion selects new samples in the design space that have a high probability of prediction to remain below a certain limit or threshold (g min ). In this work, this probability is multiplied by the candidate variance σ 2 x,D to include a component of exploration. Using the PoF criterion, any new point x new can be selected as: where Φ is the cumulative density function of the standard normal distribution. The PoF criterion is mostly used with the Gaussian process or Kriging models. Equation (5), F(x) is a random variable with prediction mean µ x,D and variance σ 2 x,D at any point x and D in the observed data.

Entropy
Entropy is a model-dependent approach that can be interpreted as a measurement of homogeneity (or uncertainty) in the data. Using Bayesian Active Learning the information gain can be expressed in terms of predictive entropy, and parameter uncertainty can be minimized. Higher weight is given to samples that maximize the decrease in expected posterior entropy. In this work, we used Bayesian Active Learning by Disagreement [32] (BALD) algorithm and it computes entropies in the binary output space. Using BALD, the new point x new that minimizes the entropy can be obtained as: where θ is a latent parameter that controls the dependence between input x and output variables y, i.e., p[y|x, θ] with p being the posterior distribution. The BALD algorithm requires posterior mean and variance to be computed. These posterior for each x can be easily computed using a GP model.
For any x, the objective in Equation (7) is simplified to Equation (8)

Classifier Description
In this section, various classifiers (SVM, GP, and LR) that are used in this work are briefly discussed. For a detailed discussion, interested readers are referred to [25][26][27]. Firstly, the SVM classifier can be given as: where b is a scalar quantity, which is also noted the bias and λ i are the Lagrange multipliers. K is the SVM kernel function. A suitable selection of the kernel function is vital for the performance of the SVM model. The Gaussian process model is widely used in regression problems owing to its well defined posterior formulation and computation. For classification problems, it is not possible to compute posterior quantities directly and a suitable approximation is required to compute posterior quantities. Classification models aim to predict the class label (y i ) for given test inputs (x i ). In a binary case, the probability to classify x i in one of the two classes is given by: where M is the training set and f is a function to map. In many cases, the above expression is intractable and suitable approximation (such as Laplace approximation, expectation propagation) is required in order to obtain prediction mean and variance. Finally, in a binary case the probability to classify in one of the two classes using Logistic Regression is: where the coefficients β j can be obtained by maximum likelihood estimation.
In the next step, SVM, GP, and LR classifiers are trained on the training data collected by all adaptive algorithms. The constructed classifiers are validated against labeled test data to assess model performance. In the final step, the performance of all considered adaptive sampling algorithms is compared on different classifiers using various classification performance metrics [33]. The complete adaptive classification process is summarized in Figure 2. The initial samples are obtained by the Latin Hypercube Design (LHD [34]) and the corresponding output values (labels) are obtained by micromagnetic simulations. Next, a new batch of samples (sample) is obtained via an adaptive design algorithm. The output (labels) are computed for additional samples by micromagnetic simulations for mode M1 and by sub-sample from a dense grid sampling for mode M2. The process is repeated until the stopping criterion is reached. The classifier is then constructed on the final training data set. Note that in Figure 2, the highlighted portion and dotted area corresponds to the loop for model-dependent techniques only.

Logic Device Description
The structure of the logic device [15] evaluated is shown in Figure 3a. The device dimensions are 2 nm thick, 20 nm wide, and 80 nm long. It consists of two regions: Input (R1) and output (R2). R1 and R2 are interconnected through a magnetic bus and have in-plane and out-of-plane magnetic anisotropy along the y direction respectively. To achieve a bistable canted magnetization, the length of the R1 and R2 regions are fixed as 20 nm. To avoid any strong exchange coupling between R1/R2, the interconnect length is set to 40 nm. Four possible combinations for the R1/R2 states can be defined based on the bistability of the canted magnetic regions. In the absence of an external magnetic field these states are defined in Figure 3b. The regions R1 and R2 can have a magnetization state '0' and '1' described by M y /M s ∼ = 0.2 and M y /M s ∼ = −0.2. The device is triggered by the application of an external magnetic field as shown in Figure 3a and a logic operation is performed. The applied magnetic field is parameterized by amplitude (H R ) and duration (T R ) and the behavior of the device responds according to these values. The external field will be applied at region R1 and it is desired to control the response of region R2. Thereafter, via magnetic exchange interaction, the logic state (0/1) of the region R1 is transmitted to region (R2).
In order for the device to perform logic, two possible logic operation modes M1 and M2 are defined, with both modes described in Table 1. For any 'XX' with X → (0, 1) in Table 1 represent the logic state of the entire structure, for instance: '01' represent the logic state of the entire structure where '0' and '1' are the logic state of the region R1 and R2 respectively. Four possible stable states of the structure's magnetization are given as: '00', '01', '10', and '11' as shown in Figure (Figure 3b).  Under the application of input field (triggering field) the magnetization dynamics of the input and output regions in the structure (see Figure 3a) change. The switching behavior is dependent on the magnitude, direction, and duration of the input field. For this purpose, we have considered two cases based on the direction of the applied input field, see Figure 4.

(a)
Input field applied along the y-axis: corresponds to mode M1; (b) Input field applied along the z-axis: corresponds to mode M2 (BUF and INV).
In this work, field amplitude and duration are parameterized in the domain: 0.5 ≤ H R ≤ 8 (in kA/m) and 0.1 ≤ T R ≤ 0.5 (in ns) respectively for input field application along the y-axis. In the case of field application along the negative z-axis the field amplitude is parameterized in the domain: −8 ≤ H R ≤ −0.5 (in kA/m) while T R is the same as in case (a). The initial and final state of the regions R1/R2 corresponds to before and after the application of the external field. Based on the input and output states of the regions, it can be determined whether each region switched or not. Henceforth, from all input field conditions, interesting operating conditions can be extracted. For any triggered field the logic behavior of the structure in the mode M1/M2 can be extracted by micromagnetic simulations.

Results
The adaptive sampling process starts with initial samples, which are based on LHD. The corresponding labels are obtained by micromagnetic simulations using OOMMF. To assess the performance of all techniques a test set of 1271 samples is used. These samples are generated based on a full grid testing designed to sufficiently characterize device behavior. The results from adaptive sampling strategies are also compared with equivalent one-shot design generated by LHD on various classifiers. The performance of all considered approaches is assessed by employing the following classification performance metrics given for a binary case as: The OOMMF micromagnetic solver [16] is used to perform all micromagnetic simulations whereas GPflowOpt [35], an open-source python-based package, is used to perform adaptive sampling based on PoF and Entropy criteria. The NV samples are generated using the SUMO toolbox [31,36]. To generate EDSD samples, the CODECS toolbox [37] is used.

Input Field Along Y-Axis
A positive external field is applied along the +y-axis as shown in Figure 4a. In the absence of any external field, the initial magnetization state of the structure is '01'. We are interested in the final state of R2 after the application of the external field. After field application, the final magnetization states of the structure will be either feasible ('00','10') or infeasible ('01', '11'). This presents a binary classification problem where magnetization states '0' and '1' are represented by class 0 and class 1. The adaptive sampling algorithms initiates with an initial 5 samples obtained by a space-filling LHD design with an exception of EDSD where Centroidal Voronoi Tessellation (CVT) is used. The training data is extended by adding one sample at a time (adaptively) until a total training budget of 30 samples is reached.
The training samples generated by adaptive sampling techniques (EDSD, NV, PoF, and Entropy) and one-shot design (LHD) are plotted in Figure 5. For reference purposes, the true operating and no operation regions are also plotted in Figure 5a. While, EDSD performed (Figure 5c) more exploitation around the class boundaries, NV samples maintained a balance between exploration and exploitation (Figure 5d). In the case of Entropy and PoF, the samples obtained are well defined along the class boundary (Figure 5e,f). On the contrary, LHD results in a spaced filled design (Figure 5b and clearly neglects to sample around the class boundary. Overall, a good distribution of the obtained samples is observed for EDSD, NV, Entropy, and PoF techniques. Table 2 summarizes the performance of all classifiers built on different training samples. In the case of PoF, 99.92% classification accuracy is achieved while one-shot design resulted in a 97.4% accuracy of the classifier. Note that, with a total number of 20 samples obtained adaptively (5 initial + 15 adaptively) by using the PoF algorithm, 99 percent classification accuracy is achieved on a GP classifier which is a significant improvement over accuracy obtained using a LHD design of 30 samples. Overall, NV, EDSD, Entropy, and PoF performed marginally better than the one-shot design. The sampling schemes EDSD, Entropy, and PoF sample well around the decision boundary which is highlighted in the higher precision and recall values. Moreover, EDSD performs exploitation uniformly around the class boundary.  Moreover, in Table 2 the number of misclassified observations by a classifier is compared for each class for different samplings. The results highlight how well a classifier can accurately predict labels around the class boundaries on a test set. Overall, GP and SVM classifier built on PoF samples results in the least misclassified observations in all classes.
The missclassification error (in percent), which is computed as the percent of total misclassified observations predicted by a classifier is visualized in Figure 6a for all cases. PoF and EDSD resulted in the least misclassified observations (0.078% and 0.236%) while LHD and NV (2.59% and 2.36%) have the highest missclassification error for the best scenario. The worst performance is reported by the LR classifier in all cases. Since LR is a linear model and the decision boundary is highly nonlinear, this performance of LR was expected.
(a) (b) Figure 6. Total misclassified observations reported for each classifier type for test data (total percent): (a) External field along +y-axis and (b) external field along −z-axis.

Input Field Along Z-Axis
The case corresponds to the field application along the negative z-axis as shown in Figure 4b. Two state propagation behaviors are possible i.e., normal state propagation (buffer mode: BUF) and inverted state propagation (inverter mode: INV) corresponding to the transitions in mode M2 (Table 1). In the absence of any external field, the initial state of the structure is '01' and '00' for BUF and INV mode respectively. The feasible and infeasible final states for the BUF mode are ('00','10') and ('01','11') while for the INV mode are '01', '11' and '00','10'. This is a multi-class classification problem with class labels 0, 1, and 2 are assigned to NOOP (no operating), BUF, and INV modes respectively. This case is comparatively complex to classify as there exist regions in the input space which have a significantly small area (Figure 7a). The adaptive sampling initiates with 20 samples arranged in a TPLHD and a new sample is added one at a time until a total training data of 100 samples is reached.
The training samples obtained by running various sampling techniques are plotted in Figure 7. While for reference purpose, true BUF, INV, and NOOP regions are also plotted in Figure 7a. It is observed that in the case of Entropy and PoF, the samples are densely selected around the edges of the feasible regions (class boundaries). Moreover, with the same initial design for PoF, Entropy, and NV, both Entropy and NV are able to identify all operating/NOOP regions (fully/partially) while the PoF completely failed to identify narrow the BUF region. The EDSD algorithm is able to sample around all decision boundaries. However, it focused on local exploration around decision boundaries while missing global exploration (Figure 7c). The NV algorithm is able to perform global exploration and exploitation and results in sparser samples than EDSD. The performance of SVM, GP, and LR classifiers constructed on different training sets is reported in Table 3 for all cases. A significant improvement is observed in the classification accuracy of models constructed on adaptive samples over LHD. Overall, the best accuracy is achieved by a GP classifier with the PoF criteria. It shows an improvement of 4.72%, 5%, and 5% in accuracy, precision, and recall over the best performing classifier built on LHD respectively. Moreover, the number of misclassified observations predicted by a classifier for each class is also reported in Table 3.  The probability of missclassification is higher in the regions around the boundaries. The missclassification error also highlights how accurate boundaries are identified. In Figure 6b, the analysis of the total number of misclassified observations is performed for all classifiers. It can be seen that SVM and GP classifiers built on PoF, EDSD, and NV have similarly low missclassification errors, i.e., regions around the boundary are well identified. Moreover, the GP classifier utilizing PoF shows an improvement of 4.72% over a GP model built on LHD. The LR model performs poorly in all cases because of its linear behavior. These results are visualized in Figure 8 where true and learned class boundaries and misclassified observations are plotted for selected cases.

Conclusions
The novel application of Data-Efficient Machine Learning techniques (DEML) is presented to characterize the behavior of non-charge-based logic devices. The adaptive sampling techniques substantially minimize the number of simulations (samples) required to characterize the dependence of input field conditions on the logic behavior. The performance of the various models and input-output-based adaptive sampling techniques are evaluated on classifiers built for binary and multi-class classification problems. The classification based on the adaptive sampling strategy significantly outperformed one-shot design and full grid sampling. In future work, the application of data-efficient machine learning techniques will be expanded to more challenging problems such as majority-based logic structures.