MEG Source Localization via Deep Learning

We present a deep learning solution to the problem of localization of magnetoencephalography (MEG) brain signals. The proposed deep model architectures are tuned to single and multiple time point MEG data, and can estimate varying numbers of dipole sources. Results from simulated MEG data on the cortical surface of a real human subject demonstrated improvements against the popular RAP-MUSIC localization algorithm in specific scenarios with varying SNR levels, inter-source correlation values, and number of sources. Importantly, the deep learning models had robust performance to forward model errors resulting from head translation and rotation and a significant reduction in computation time, to a fraction of 1 ms, paving the way to real-time MEG source localization.


I. INTRODUCTION
A ccurate localization of functional brain activity holds promise to enable novel treatments and assistive technologies that are in critical need by our aging society. The ageing of the world population has increased the prevalence of age-related health problems, such as physical injuries, mental disorders, and stroke, leading to severe consequences for patients, families, and the health care system. Emerging technologies can improve the quality of life of patients by i) providing effective neurorehabilitation, and ii) enabling independence in everyday tasks. The first challenge may be addressed by designing neuromodulatory interfacing systems that can enhance specific cognitive functions or treat specific psychiatric/neurological pathologies. Such systems could be driven by real-time brain activity to selectively modulate specific neurodynamics using approaches such as transcranial magnetic stimulation [1], [2] or focused ultrasound [3], [4]. The second challenge may be addressed by designing effective brain-machine interfaces (BMI). Common BMI control signals rely on primary sensory-or motor-related activation. However, these signals only reflect a limited range of cognitive processes. Higher-order cognitive signals, and specifically those from prefrontal cortex that encode goal-oriented tasks, could lead to more robust and intuitive BMI [5], [6].
Both neurorehabilitation and BMI approaches necessitate an effective and accurate way of measuring and localizing functional brain activity in real time. This can be achieved by electroencephalography (EEG) [7], [8] and MEG [9], [10], [11], two non-invasive electrophysiological techniques. EEG uses an array of electrodes placed on the scalp to record voltage fluctuations, whereas MEG uses sensitive magnetic detectors called superconducting quantum interference devices (SQUIDs) [12] to measure the same primary electrical currents that generate the electric potential distributions recorded in EEG. Since EEG and MEG capture the electromagnetic fields produced by neuronal currents, they provide a fast and direct index of neuronal activity. However, existing MEG/EEG source localization methods offer limited spatial resolution, confounding the origin of signals that could be used for neurorehabilitation or BMI, or are too slow to compute in real time.
Deep learning (DL) [13] offers a promising new approach to significantly improve source localization in real time. A growing number of works successfully employ DL to achieve stateof-the-art image quality for inverse imaging problems, such as X-ray computed tomography (CT) [14], [15], [16], magnetic resonance imaging (MRI) [17], [18], [19], positron emission tomography (PET) [20], [21], image super-resolution [22], [23], [24], photoacoustic tomography [25], synthetic aperture radar (SAR) image reconstruction [26], [27] and seismic tomography [28]. Here, we develop and present a novel DL solution to localize neural sources, and assess its accuracy and robustness with simulated MEG data on the sensor geometry of the whole-head Elekta Triux MEG system and source space the cortical surface extracted from a structural MRI scan of a real human subject. While we focus on MEG, the same approaches are directly extendable to EEG, enabling a portable and affordable solution to source localization. The conceptual novelties of the proposed DL methods are: i) state-of-the-art localization accuracy of multiple sources from noisy MEG measurements, and ii) very fast computation time (fraction of a millisecond) for inference of the source locations.

II. BACKGROUND ON MEG SOURCE LOCALIZATION
Two noninvasive techniques, MEG and EEG, measure the electromagnetic signals emitted by the human brain and can provide a fast and direct index of neural activity suitable for real-time applications. The primary source of these electromagnetic signals is widely believed to be the electrical currents flowing through the apical dendrites of pyramidal neurons in the cerebral cortex. Clusters of thousands of synchronously activated pyramidal cortical neurons can be modeled as an equivalent current dipole (ECD). The current dipole is therefore the basic element used to represent neural activation in MEG and EEG localization methods.
In this section we briefly review the notations used to describe measurement data, forward matrix, and sources, and formulate the problem of estimating current dipoles. Consider an array of M MEG or EEG sensors that measures data from a finite number Q of equivalent current dipole (ECD) sources emitting signals {s q (t)} Q q=1 at locations {p q } Q q=1 . Under these assumptions, the M × 1 vector of the received signals by the array is given by: where l(p q ) is the topography of the dipole at location p q and n(t) is the additive noise. The topography l(p q ), is given by: where L(p q ) is the M × 3 forward matrix at location p q and q is the 3 × 1 vector of the orientation of the ECD source.
Depending on the problem, the orientation q may be known, referred to as fixed-oriented dipole, or it may be unknown, referred to as freely-oriented dipole.
Assuming that the array is sampled N times at t 1 , ..., t N , the matrix Y of the sampled signals can be expressed as: where Y is the M × N matrix of the received signals: A(P) is the M × Q mixing matrix of the topography vectors at the Q locations P = [p 1 , ..., p Q ]: S is the Q × N matrix of the sources: with Mathematically, the localization problem can be cast as an optimization problem of computing the location and moment parameters of the set of dipoles whose field best matches the MEG/EEG measurements in a least-squares (LS) sense [29]. In this paper we focus on solutions that solve for a small parsimonious set of dipoles and avoid the ill-posedness associated with imaging methods that yield distributed solutions, such as minimum-norm [9]. Solutions that estimate a small set of dipoles include the dipole fitting and scanning methods. Dipole fitting methods solve the optimization problem directly using techniques that include gradient descent, Nedler-Meade simplex algorithm, multistart, genetic algorithm, and simulated annealing [30], [31], [32], [33]. However, these techniques remain unpopular because they converge to a suboptimal local optimum or are too computationally expensive.
An alternative approach is scanning methods, which use adaptive spatial filters to search for optimal dipole positions throughout a discrete grid representing the source space [34]. Source locations are then determined as those for which a metric (localizer) exceeds a given threshold. While these approaches do not lead to true least squares solutions, they can be used to initialize a local least squares search. The most common scanning methods are beamformers [35], [36] and MUSIC [29], both widely used for bioelectromagnetic source localization, but they assume uncorrelated sources. When correlations are significant, they result in partial or complete cancellation of correlated (also referred to as synchronous) sources) and distort the estimated time courses. Several multisource extensions have been proposed for synchronous sources [37], [38], [39], [40], [41], [42], [43], [44], [45], however they require some a-priori information on the location of the synchronous sources, are limited to the localization of pairs of synchronous sources, or are limited in their performance.
One important division of the scanning methods is whether they are non-recursive or recursive. The original Beamformer [35], [36] and MUSIC [29] methods are non-recursive and require the identification of the largest local maxima in the localizer function to find multiple dipoles. Some multi-source variants are also non-recursive (e.g. [46], [38], [40], [41]), and as a result they use brute-force optimization, assume that the approximate locations of the neuronal sources have been identified a priori, or still require the identification of the largest local maxima in the localizer function. To overcome these limitations, non-recursive methods have recursive counterparts, such as RAP MUSIC [47], TRAP MUSIC [48], Recursive Double-Scanning MUSIC [49], and RAP Beamformer [7]. The idea behind the recursive execution is that one finds the source locations iteratively at each step, projecting out the topographies of the previously found dipoles before forming the localizer for the current step [47], [7]. In this way, one replaces the task of finding several local maxima with the easier task of finding the global maximum of the localizer at each iteration step. While recursive methods generally perform better than their non-recursive counterparts, they still suffer from several limitations, including limited performance, the need for high signal-to-noise ratio (SNR), non-linear optimization of source orientation angles and source amplitudes, or inaccurate estimation as correlation values increase. The are also computationally expensive due to the recursive estimation of sources.
III. BACKGROUND ON DEEP LEARNING Inverse problems in signal and image processing were traditionally solved using analytical methods, however, recent DL [13] solutions, as exemplified in [50], [51], provide state-of-the-art results for numerous problems including x-ray computed tomography, magnetic resonance image reconstruction, natural image restoration (denoising, super-resolution, debluring), synthetic aperture radar image reconstruction and hyper-spectral unmixing, among others. In the following we review DL principles, which form the basis for the DLbased solutions to MEG source localization presented in the next section, including concepts such as network layers and activation functions, empirical risk minimization, gradientbased learning, and regularization.
DL is a powerful class of data-driven machine learning algorithms for supervised, unsupervised, reinforcement and generative tasks. DL algorithms are built using Deep Neural Networks (DNNs), which are formed by a hierarchical composition of non-linear functions (layers). The main reason for the success of DL is the ability to train very high capacity (i.e hypothesis space) networks using very large datasets, often leading to robust representation learning [52] and good generalization capabilities in numerous problem domains. Generalization is defined as the ability of an algorithm to perform well on unseen examples. In statistical learning terms an algorithm A : X → Y is learned using a training dataset a data sample and y i ∈ Y is the corresponding label (for example, source location coordinates). Let P(X , Y) be the true distribution of the data, then the expected risk is defined by: where L is a loss function that measures the misfit between the algorithm output and the data label. The goal of DL is to find an algorithm A within a given capacity (i.e. function space) that minimizes the expected risk, however, the expected risk cannot be computed since the true distribution is unavailable. Therefore, the empirical risk is minimized instead: which approximates the statistical expectation with an empirical mean computed using the training dataset. The generaliza-tion gap is defined as the difference between the expected risk to the empirical risk: R(A) − R E (A). By using large training datasets and high capacity algorithms, DL has been shown to achieve a low generalization gap, where an approximation of the expected risk is computed using the learned algorithm and a held-out testing dataset In the following subsections we describe the main building blocks of DNNs, including multi-layer perceptron and convolutional neural networks.

A. Multi-Layer Perceptron (MLP)
The elementary building block of the MLP is the Perceptron, which computes a non-linear scalar function, termed activation, of an input x ∈ R n , as follows: where w is a vector of weights and b is a scalar bias. A common activation function is the Rectified Linear Unit (ReLU) [13], defined as: and in this case the perceptron is given by: other common activation functions are LeakyReLU, sigmoid and the tanh (i.e. hyperbolic tangent). A single layer of perceptrons is composed of multiple perceptrons, all connected to the same input vector x, with a unique weight vector and bias, per perceptron. A single layer of perceptrons can be formulated in matrix form, as follows: where each row of the matrix W corresponds to the weights of one perceptron, and each element of the vector b corresponds to the bias of one perceptron. The MLP is composed of multiple layers of perceptrons, such that the output of each layer becomes the input to the next layer. Such hierarchical composition of k non-linear functions is formulated as follows: where θ i = [W i , b i ] are the parameters (i.e weights and biases) of the i-th layer and Θ = [θ 1 , θ 2 , . . . , θ k ] is the set of all network parameters.
In the supervised learning framework, the parameters Θ are learned by minimizing the empirical risk, computed over the training dataset S. The empirical risk can be regularized in order to improve DNN generalization, by mitigating overfitting of the learned parameters to the training data. The regularized empirical risk is given by where α ≥ 0 controls the weight of the regularization term, which is often chosen as Tikhonov regularization R(Θ) = Θ 2 2 or L 1 regularization R(Θ) = Θ 1 that promotes sparsity of the network parameters. The optimal set of parameters Θ * are obtained by solving where the minimization of the empirical risk is typically performed by iterative gradient-based algorithms, such as the stochastic gradient descent (SGD) whereΘ t is the estimate of Θ * at the t-th iteration, λ > 0 is the learning rate, and the approximated gradient ∇ Θ J(Θ) is computed by the back-propagation algorithm using a small random subset of examples from the training set S.

B. Convolutional Neural Networks
Convolutional Neural Networks (CNNs) were originally developed for processing input images, using the weight sharing principle of a convolutional kernel that is convolved with input data. The main motivation is to reduce significantly the number of required learnable parameters, as compared to processing a full image by perceptrons, namely, allocating one weight per pixel for each perceptron. A CNN is composed of one or more convolutional layers, where each layer is composed of one of more learnable kernels. For a 2D input I(i, j), a convolutional layer performs the convolution 1 between the input to the kernel(s) where W (m, n) is the kernel and C(i, j) is the convolution result. A bias b is further added to each convolution results, and an activation function f () is applied, to obtain the feature map F (i, j) given by 1 Some ML libraries implement the cross-correlation operation.
A convolutional layer with K kernels produces K feature maps, where kernels of 1D, 2D or 3D are commonly used. Convolutional layers are often immediately followed by subsampling layers, such as MaxPooling that decimates information by picking the maximum value within a given array of values, or AveragePooling that replaces a given array of values by their mean. CNN networks are typically composed by a cascade of convolutional layers, optionally followed by fully-connected (FC) layers, depending on the required task. A frequently used architecture is the convolutional encoderdecoder, which learns a low-dimensional representation of the input (i.e. encoding), which is further utilized to reconstruct the output (i.e. decoding). Convolutional encoder-decoder architectures can be utilized for representation learning [52] in an unsupervised learning framework (i.e. convolutional autoencoder [53]), or for complex output data reconstructions. A well-known convolutional encoder-decoder architecture is the U-Net [54], originally proposed for medical image segmentation. Another high successful CNN architecture, termed ResNet, employs residual blocks with skip connections that enable the training of very deep networks [55].

IV. DEEP LEARNING FOR MEG SOURCE LOCALIZATION
In this section we present the proposed deep neural network (DNN) architectures and training data generation workflow.

A. MLP for Single-Snapshot Source Localization
MEG source localization is computed from sensor measurements using either a single snapshot (i.e. a single time sample) or multiple snapshots. The single snapshot case is highly challenging for popular MEG localization algorithms, such as MUSIC [29], RAP-MUSIC [47], and RAP-Beamformer [7], all of which rely on the data covariance matrix. A single snapshot estimation of the covariance matrix is often insufficient for good localization accuracy of multiple simultaneously active sources, especially in low and medium signalto-noise ratios (SNR). Since the input in this case is a single measurement MEG vector, we implemented four layer MLPbased architectures, where the input FC layer maps the Mdimensional snapshot vector to a higher dimensional vector and the output layer computes the source(s) coordinates in 3D, as illustrated in Figure 2. We refer to this model as DeepMEG-MLP. We implemented three DeepMEG-MLP models, corresponding to Q = 1, 2, and 3 sources, as summarized in Table  I.

B. CNN for Multiple-Snapshot Source Localization
For multiple consecutive MEG snapshots we implemented a CNN-based architecture with five layers, in which the first layer performs 1D convolutions on the input data and the resulting 1D feature maps are processed by three FC layers with sigmoid activation, and an output FC layer which computes the source locations. We refer to this model as DeepMEG-CNN, as illustrated in Figure 3. The 1D convolutional layer forms a bank of L = 32 space-time filters (which can also be interpreted as beamformers [56]). Each 1D temporal filter spans T = 5 time samples. A different 1D filter is applied to the time course of each of the M sensors with uniquely learned coefficients. We implemented three DeepMEG-CNN models, corresponding to Q = 1, 2, and 3 sources, as summarized in Table I.

C. Data Generation Workflow
To train the deep network models and evaluate their performance on source localization, we need to know the ground truth of the underlying neural sources generating MEG data. Since this information is unavailable in real MEG measurements of human participants, we performed simulations with an actual MEG sensor array and a realistic anatomy and source configurations. Specifically, the sensor array was based on the whole-head Elekta Triux MEG system (306channel probe unit with 204 planar gradiometer sensors and 102 magnetometer sensors) (Figure 4a). The geometry of the MEG source space was modeled with the cortical manifold extracted from a T1-weighted MRI structural scan from a real adult human subject using Freesurfer [57]. This source configuration is consistent with the arrangement of pyramidal neurons, the principal source of MEG signals, in the cerebral cortex. Sources were restricted to approximately 15,000 grid points over the cortex (Figure 4b). The lead field matrix, which represents the forward mapping from the activated sources to the sensor array, was estimated using BrainStorm [58] based on an overlapping spheres head model [59].
Simulated MEG sensor data was generated by first activating a few sources randomly selected on the cortical manifold with activation time courses s i (t). The time courses were modeled with 16 time points sampled as mixtures of sinusoidal signals. The corresponding sensor measurements were then obtained by multiplying each source with its respective topography vector l(p i ) (Figure 4c). Finally, Gaussian white noise was generated and added to the MEG sensors to model instrumentation noise at specific SNR levels. The SNR was defined as the ratio of the Frobenius norm of the signal-magnetic-field spatiotemporal matrix to that of the noise matrix for each trial as in [60].

A. Deep Network Training
The DeepMEG models were implemented in TensorFlow [61] and trained using the SGD algorithm (15) with a learning rate λ = 0.001 and batch size of 32. The DeepMLP networks were trained with datasets of 1 million simulated snapshots, generated at a fixed SNR level, yet, as discussed in the following these trained models operate well in a wide range of SNR levels. The DeepCNN networks were trained using data generation on the fly 2 and a total of 9.6 million multiplesnapshot signals per network. The DeepCNN network models were trained with MEG sensor data at a fixed SNR-level and random inter-source correlations, thus, learning to localize sources with a wide range of inter-source correlation levels.

B. Localization Experiments
We assessed the performance of the deepMEG models using simulated data as described in the data generation workflow. To assess localization accuracy in different realistic scenarios, we conducted simulations with different SNR levels and intersource correlation values. We also varied the number of active sources to validate that localization is accurate even for multiple concurrently active sources.
During inference, we compared the performance of the deep learning model against the popular scanning localization solution RAP-MUSIC [47]. All experiment were conducted with 1000 Monte-Carlo repetitions per each SNR and inter-source correlation value. In each scenario, we used the deepMEG and RAP-MUSIC with the corresponding number of sources, which are assumed known 3 by both methods.
1) Experiment 1: Performance of the DeepMEG-MLP model with single-snapshot data: We assessed the localization accuracy of the DeepMEG-MLP model against the RAP-MUSIC method for the case of two simultaneously active dipole sources. The DeepMEG-MLP model was trained with 10 dB SNR data, but inference used different SNR levels ranging from -10 dB to 20 dB ( Figure 5). The DeepMEG model outperformed the MAR-MUSIC method at high SNR values, but had worse localization results in low SNR values (< 5 dB). As expected, both methods consistently improved their localization performance with increasing SNR values.
2) Experiment 2: Performance of the DeepMEG-CNN model with multiple-snapshot data: We extended the above experiment for the case of multiple snapshot data with T = 16 time samples and two or three sources with different intersource correlation values. The DeepMEG-CNN model was trained with -15 dB SNR data, and inference used -15 dB, -12.5 dB, and -10 dB SNR. In the low SNR case (-15 dB), the DeepMEG-CNN consistently outperformed the RAP-MUSIC method with the exception of high (0.9) correlation values where the RAP-MUSIC had a slightly better accuracy ( Figure  6ab). As SNR increased to -12.5 dB, the Deep MEG model remained overall better or had comparable performance to RAP-MUSIC (Figure 6cd). This advantage was lost at -10 dB SNR, where RAP-MUSIC had an advantage (Figure 6ef).
3) Experiment 3: Robustness of DeepMEG to forward model errors: Here we assumed that the actual MEG forward model is different from the ideal forward model that was used for building the training set of the DeepMEG solution. The actual model was defined as follows: A(P) = A(P) + ∆A(P), where ∆A(P) denotes the model error, which can be arbitrary. In this experiment we sampled the model error from a normal distribution, with zero mean and variance proportional to the Frobenius norm of the ideal forward model A(P). Figure 7 presents the results with model error variance equal to 5%, 10%, and 20% of the Frobenius norm of A(P), as compared to the performance without model error. The DeepMEG-CNN had robust localization accuracy in all cases.

C. Real-Time Source Localization
An important advantage of the DL approaches is that they have significantly reduced computational time, paving the way to real-time MEG source localization solutions. We conducted

VI. CONCLUSIONS
Fast and accurate solutions to MEG source localization are crucial for real-time brain imaging, and hold the potential to enable novel applications in neurorehabilitation and BMI. Current scanning solutions have low resolution and limit the number of dipoles and temporal rate of source localization due to high computational demands. In this article, we reviewed existing MEG source localization solutions and fundamental DL tools. Motivated by the recent success of DL in a growing number of inverse imaging problems, we proposed two DL architectures for the solution of the MEG inverse problem, the DeepMEG-MLP for single time point localization, and the DeepMEG-CNN for multiple time point localization.
We compared the performance of DeepMEG against the popular RAP-MUSIC localization algorithm and showed improvements in localization accuracy in a range of scenarios with variable SNR levels, inter-source correlation values, and number of sources. Importantly, the DeepMEG inference was estimable in less than a millisecond and thus was orders of magnitude faster than RAP-MUSIC. Fast computation was possible due to the high optimization of modern DL tools, and even allows the rapid estimation of dipoles at near the 1 kHz sampling rate speed of existing MEG devices. This could facilitate the search for optimal indices of brain activity in neurofeedback and BCI tasks.
A key property of the DeepMEG model was its robustness to forward model errors. The localization performance of the model remained stable even when distortions were up to 20% of the Frobenious norm of the lead field matrix. This is critical for real-time applications where the forward matrix is not precisely known, or movement of the subject introduces timevarying inaccuracies.
While the DeepMEG-MLP and DeepMEG-CNN architectures yielded promising localization results, future work is needed to explore different architectures, regularizations, loss functions, and other DL parameters that may further improve MEG source localization. This is particularly important in MEG and EEG source localization where DL tools have not yet been developed, with the exception of a hybrid dSPM-LSTM solution [62].