1. Introduction
The nervous system transmits signals and performs computation through the spiking activity of neurons. To understand how information is encoded at the level of large populations, it is essential to characterize the spatiotemporal “recipes” that emerge from interactions among many neurons.
Even without explicit cognitive tasks, the brain exhibits spontaneous activity. Across development, accumulated experience and internal constraints shape this ongoing activity, embedding a repertoire of latent patterns that can be rapidly recruited when external inputs arrive [
1,
2]. Yet precisely because spontaneous activity is rich and not time-locked to an external event, it has often been treated as “noise,” and its moment-to-moment evolution remains challenging to predict and compare across settings [
1,
3]. In this sense, spontaneous activity may carry a compressed trace of an individual’s past experiences, while remaining difficult to model because of its complexity.
Population dynamics are not arbitrary: they are constrained by macroscale and mesoscale wiring architecture. Large-scale anatomical and functional connectomics has sharpened the closely linked but nontrivial relationship between structure and activity in the cortex [
4,
5,
6,
7]. Complementary network-level analyses further quantify how structure–function coupling varies across space and time, reinforcing the view that spontaneous dynamics are shaped by an anatomical scaffold [
6,
7]. Moreover, effective interactions can sometimes be inferred from activity time series, implying that partially predictable dynamical motifs may be embedded in spontaneous fluctuations [
8,
9].
From a translational viewpoint, a central challenge is whether dynamical regularities learned under one recording preparation generalize to another. Compared with within-preparation settings emphasized in prior neural sequence modeling, transfer between in vitro and in vivo confronts a larger, compound shift: brain state and non-stationarity differ, noise and artifact structure (including spike-sorting biases) change, firing rates and sparsity regimes shift, and there is typically no neuron-to-neuron correspondence or paired trials to anchor the mapping. Nonetheless, in vitro preparations allow controlled manipulation and stable recording conditions, whereas in vivo recordings capture dynamics in an intact, behaving organism; quantifying links between these regimes could provide a reusable bridge for downstream analyses and inform coordinated experimental design [
10,
11].
Recent sequence models now capture population spiking at scale, suggesting that transferable dynamical motifs exist even without explicit stimuli [
12]. Transformer-based neural-data models (e.g., NDT, STNDT, NDT2) further enable efficient parallel modeling of binned spike trains, and are often evaluated via reconstruction, forecasting, or decoding within a shared recording context (same preparation, subject/session, or closely related tasks) [
13,
14,
15].
Cross-preparation transfer, however, remains comparatively under-quantified in a one-step-ahead, 1 ms binned prediction/generation setting, despite its potential to generalize beyond a single dataset or laboratory setting.
We summarize our contributions as:
We formulate and benchmark a bidirectional, one-step-ahead, 1 ms binned transfer task between in vitro and in vivo population spike trains, and we describe a standardized evaluation procedure for cross-preparation generation.
We present one concrete implementation using an autoregressive transformer for sparse binary events and compare Dice loss with Binary Focal Cross-Entropy (γ = 2.0) under extreme class imbalance.
We report both ROC- and PR-based metrics (ROC-AUC, Precision–Recall curves, and PR-AUC) for prediction/generation under extreme sparsity. Detailed hyperparameter sweeps and compute–accuracy trade-offs are reported in the
Supplementary Materials.
We clarify data splitting/independence and the 128-neuron standardization procedure, and we provide
Supplementary Materials for reproducibility and additional analyses.
We formulate and benchmark a bidirectional, one-step-ahead, 1 ms binned transfer task between in vitro and in vivo population spike trains, and we provide a standardized evaluation procedure that can be reused across preparations. As one concrete instantiation, we implement an autoregressive transformer for sparse binary events and evaluate Dice loss against Binary Focal Cross-Entropy (γ = 2.0). To quantify performance under severe sparsity, we report both ROC- and PR-based metrics (ROC-AUC, Precision–Recall curves, and PR-AUC) for prediction and generation. Finally, we clarify data splitting and independence assumptions, detail the 128-neuron standardization procedure, and provide
Supplementary Materials to support reproducibility and additional analyses.
Figure 1 provides an overview of the datasets utilized in this study.
Figure 1a depicts the setup of an in vitro electrophysiological experiment. A brain acute slice is placed on an electrode, perfused with artificial cerebrospinal fluid (ACSF), and continuously bubbled with oxygen-enriched gas while neural activity is recorded.
Figure 1b illustrates the setup for an in vivo electrophysiological experiment. In this case, electrodes are inserted into targeted brain regions of a living mouse to measure neural activity.
The recorded brain regions and abbreviations used in this study are summarized in
Table 1.
The in vitro data were obtained by the Shimono Lab, with recordings conducted from multiple regions of the left cerebral hemisphere. In contrast, the in vivo data were collected by the Dora Angelaki lab., Thomas Mrsic-Flogel lab., and Sonja Hofer lab, covering a wide range of brain regions, including the motor cortex, visual cortex, hippocampus, and amygdala.
Notably, for both the in vitro and in vivo collections, each dataset was recorded from a different mouse, so no individual animal contributes data to more than one dataset, ensuring independence across datasets.
3. Results
3.1. Evaluation and Comparison of Loss Functions During Training
In this study, we trained our model using in vitro data measured from slices of seven cortical regions in the left hemisphere, as well as six in vivo datasets recorded from either the cortex or hippocampus of the left hemisphere. To optimize the learning process, we employed Dice loss as the loss function. As a result, across all training datasets, the area under the ROC curve (AUC) reached 0.92 ± 0.06 (
Figure 3). Because spike prediction is extremely class-imbalanced, we report ROC-AUC alongside Precision–Recall curves and PR-AUC, and we interpret PR-based metrics as the primary indicators of minority-event (spike) quality under extreme sparsity; ROC-AUC is retained to summarize ranking performance. During training, the true-positive and true-negative rates approached ~90% on average, but these values should be interpreted in the context of ROC/PR metrics. This high level of accuracy was difficult to achieve with alternative loss functions beyond the use of Dice loss. Moreover, the spike (1) accuracy and PR-AUC also reached very high levels, supporting that minority-event (spike) detection improved beyond what ROC-AUC alone indicates.
Neural spike data present a unique challenge due to the extremely low frequency of the “1” state (spiking events). Consequently, predicting the occurrence of these essential “1” states is highly difficult. By employing Dice loss as the error function, we effectively corrected the imbalance between the occurrence frequencies of 0 and 1, demonstrating its significant utility in this context (
Figure 3a–c).
To further assess the effectiveness of Dice loss, we compared its performance with the commonly used Binary Focal Cross Entropy Loss (
Figure 3). In this study, we quantitatively evaluated predictive performance using the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) curve. With Binary Focal Cross Entropy Loss, statistical significance was obtained only in self-prediction tasks using in vitro data. However, when Dice loss was applied, the model achieved AUC values exceeding 0.9, demonstrating high prediction accuracy not only for in vitro data but also for in vivo data (
Figure 3e,f).
These results indicate that Dice loss effectively overcomes the challenge posed by the frequency imbalance between 0 and 1, significantly improving learning performance and achieving high-precision training outcomes.
Taken together, the results for spike (1) accuracy and PR-AUC indicate that Dice loss improves minority-event (spike) detection beyond what ROC-AUC alone can capture.
3.2. Results of Generation
Our computational experiments demonstrated that training with Dice loss improved within-domain and cross-domain prediction/generation. We also include (i) training/validation loss curves for each loss function, (ii) depth/compute ablations, (iii) hyperparameter sensitivity analyses, and (iv) an explicit data-split table confirming independence of animals/slices for in vivo to in vitro transfer. Representative time-series examples are shown in
Figure 4c,d.
Figure 1e,f and
Figure 3a illustrate progressive learning and strong generalization across datasets. Even in the lowest-scoring case, the ROC-AUC for in vitro and in vivo data generation was 0.70 ± 0.09, as summarized in
Table 2 and
Figure 4b.
To analyze these results more comprehensively, we visualized the ROC-AUC scores for all combinations as a color map (
Figure 4e).
Figure 4f depicts the ROC-AUC color map matrix as a network diagram, where the inverse of the connection strength is treated as a distance and the positions are optimized in three-dimensional space. From these visualizations (
Figure 4e,f), we identified five key findings:
First, in vitro to in vitro generation showed favorable performance with ROC-AUC scores around 0.93 between identical regions. While this exceptionally high performance was unexpected, the relatively strong performance in this combination was anticipated.
Second, interestingly, in vivo to in vivo generation did not demonstrate particularly superior performance between identical regions. We did not find evidence for greater non-stationarity in the in vivo data based on an Augmented Dickey–Fuller test (p = 0.0083); however, this result depends on our preprocessing choices and the time windows analyzed [
27]. Our findings suggest that in vivo data exhibits stronger inter-regional influences compared to in vitro data, leading to variations in spike patterns.
Third, in vivo to in vitro generation outperformed in vitro to in vivo generation. One possible interpretation is that the in vivo recordings span a broader range of brain-wide state changes extending beyond the recorded region, together with stronger nonstationary dynamics, whereas the in vitro recordings reflect a more constrained subset of activity, making the mapping from in vivo to in vitro relatively easier in our setting. At the same time, this asymmetry could also reflect differences in recorded brain regions, laboratory pipelines, ages, and noise structure rather than a simple complexity reduction alone. We therefore treat this result as an important empirical asymmetry and avoid overinterpreting it as a definitive biological hierarchy.
Fourth, the lateral preoptic area (LPO) data showed relatively strong cross-region predictability within our sample. This finding will be extensively discussed in the Discussion Section. In contrast to the lateral preoptic area, the cerebellum was less effective as a seed and more readily generated from other data in our sample; this may reflect simpler patterns in these recordings, but broader data would be needed to generalize.
Fifth, we could observe that in vitro data tends to cluster together with other in vitro data. At the same time, regions related to the cortical motor area, regardless of whether they are in vitro or in vivo, are concentrated in the central part. These characteristics support the idea that the ROC-AUC-based mapping meaningfully arranges the diversity of activity. Further insights can be gained by comparing this with
Figure 1c,d.
3.3. Analysis of Information Learned by the Model
To gain insight into the predictive mechanisms of the transformer model, we conducted an analysis of the internal processing of information within the model. Here, to maintain focus and avoid unnecessary complexity, we limited our analysis to the case of in vitro to in vivo predictions.
In translation tasks using language models, source–target alignment can be reflected in attention mechanisms [
28], and analyses of multi-head self-attention have further characterized the specialized roles of individual heads [
29]. Therefore, we began our analysis by examining the attention map (see the “attention map” subsection in the Methods
Section 2.2.3). Since our study deals with binary sequences (0 s and 1 s), we analyzed the relationship between the firing rate of the output signals and the weighted attention map. Specifically, to investigate how past information influences predictions, we examined how the average weight changes relative to the diagonal components of the Attention Map, which indicate time shifts from the present moment (
Figure 5a). The results revealed a clear peak along the diagonal component, suggesting that the model heavily relies on data from immediately preceding time points for its predictions. However, off-diagonal components were also observed, indicating that the model may be assigning supplementary attention to specific past moments.
Next, we compared the attention map with the firing rates of input signals used during training. The results showed that the attention map correlated significantly with the query-side firing rate of the training data, whereas no significant correlation was found with the key-side firing rate (
Figure 5b,c). To assess the critical role of the Attention mechanism in this learning process, we analyzed performance changes when the Attention mechanism was disabled. However, the results indicated that learning performance did not significantly deteriorate (
Figure 5g, left two bars).
Given this outcome, we expanded our analysis beyond the attention map and introduced an importance measure based on gradient information, referred to as attention-weighted importance (see the “Gradient-based Importance and Attention-weighted Importance”
Section 2.2.4 in Methods). Specifically, we compared attention-weighted importance with the firing rate of input signals during training. The results showed that while attention-weighted importance exhibited a significant positive correlation with the query-side training data, it demonstrated a significant negative correlation with the key-side training data. Furthermore, generated data showed no significant correlation with either the query or key sides (
Figure 5e,f).
However, in attention-weighted importance, the key axis (columns) primarily holds information related to high-firing regions of the input signals, and the attention weights are determined based on how the query-side references this information. On the other hand, the query axis (rows) is influenced by importance-based weighting, and the distribution of attention is readjusted due to the effect of the loss function. Previously, the model focused primarily on high-firing regions, but it has been adjusted to also pay attention to low-firing regions (see
Section 2).
In fact, modifying the loss function to the Dice function significantly improved the model’s predictive performance (
Figure 3e,f). Within attention-weighted importance, the query-side prioritized referencing distinctive information from high-firing regions of the input signals during training, primarily utilizing current and immediately preceding information. However, due to the influence of the Dice function’s evaluation, the query side’s tendency to reference key-side information changed, and it appeared to be adjusted so as to also direct attention to distinctive features in past low-firing regions. As a result, the interaction between the priority derived from the input signals during training and the adjustment by the loss function may have contributed to the improvement in the model’s predictive performance.
Finally, we investigated the extent to which prediction performance deteriorated when the input data used for prediction was shuffled across different cells within the same time window (
Figure 5g, right two bars). As a result, when shuffling was applied across cells, the model’s performance gradually declined as the window size expanded further into the past. However, even when 95% of the data was shuffled, the model retained a significant level of predictive accuracy (
Figure 5g, rightmost bar). These results demonstrate that the model can still make reliable predictions even when the input data available for prediction is very limited.
4. Discussion
The primary contribution of this paper is algorithmic: predictive generation of spike trains under preparation/domain shift. As a practical implication, we highlight the value of this framework for in vitro to in vivo transfer, where bridging across preparations remains challenging in neuroscience. We also emphasize that robust, sparsity-aware evaluation is essential for downstream applications, and we therefore report ROC-based metrics alongside Precision–Recall curves and PR-AUC. The present results should be interpreted as evidence that useful temporal structure can be modeled under domain shift, not as direct proof of preserved causal biological dynamics. Whether the framework also reflects features such as excitation/inhibition balance or oscillatory structure remains an important question for future validation.
Here, we discuss the key findings of this study, categorized into technical advancements in methodology and neuroscientific insights.
4.1. Technical Advancements: The Role of Loss Function and Transformer Model
As mentioned in
Section 2.2.2, the primary model used a single transformer encoder layer. This design was chosen to retain a minimal-compute baseline and to reduce the risk of over-parameterization given the limited training data and the extreme sparsity of spikes. In the validation depth ablation (1–4 layers) under matched parameter budgets, the best depth varied by dataset and we did not observe a monotonic improvement with depth; depth = 1 remained competitive while deeper models increased compute. We therefore keep depth = 1 as the primary setting and present the full depth sweep in the
Supplementary Materials. We chose this discrete-time autoregressive transformer as a reproducible baseline for evaluating spike-event prediction under extreme sparsity. We also explored several alternative models under our computational constraints, but none outperformed the present framework in our setting. Preliminary continuous-time trials in our environment were substantially more memory-intensive, and extension toward continuous-time modeling remains an important future direction.
To our knowledge, this is the first study to combine these two elements specifically for the generation of neuronal spike trains. Despite the simplicity of the idea, no prior work recognized the unique synergy between the transformer architecture and Dice loss in addressing the extreme class imbalance inherent to neural spike data. As shown in
Figure 3e,f, this combination dramatically enhances predictive performance, outperforming the specific baselines we tested (Binary Focal Cross-Entropy and focal loss) on these datasets (
Figure 3e,f).
It enables not only accurate within-modality generation (in vitro to invitro, in vivo to in vivo), but also robust, bidirectional cross-domain generation between in vitro and in vivo conditions.
To understand the mechanisms underlying this high-accuracy generation, we first analyzed the core component of the transformer model—the attention map. Our analysis revealed that the diagonal components of the attention map contributed significantly to model performance and that the query-side axis of the attention map exhibited a positive correlation with the firing rate of the input signals during training. To address a comprehensive interpretation of the model’s behavior beyond attention map, we also introduced a gradient-based importance measure, weighting the attention map with importance scores, referred to as attention-weighted importance. A comparison between attention-weighted importance and the firing rate of input signals revealed that while the query axis showed a significant positive correlation, the key axis exhibited a significant negative correlation.
Importance strongly reflects gradient information and is highly sensitive to the choice of loss function. This suggests that the performance improvement associated with loss function selection primarily influences the region between the input layer and the self-attention mechanism in the transformer model. This adjustment enables the model to incorporate long-term historical information, facilitating more refined learning rather than relying solely on firing rates.
While the results suggested a limited role for the attention map in predictive accuracy, it is important to acknowledge its potential contributions to training stability and learning speed.
4.2. Neuroscientific Insights: Distinctive Brain Regions
Despite achieving high-accuracy predictions and generation, certain data exhibited particularly noteworthy characteristics.
The first notable finding concerns the meaningful multi-region mapping shown in
Figure 4f. This mapping clearly captures the expected spatial relationships among data points in multiple aspects. For instance, the two data points measured from the secondary motor cortex are closely aligned and surrounded by data points associated with the motor cortex both in the in vitro and in vivo data. Additionally, the in vivo and in vitro data are spatially separated into two distinct clusters on the left and right. These semantically meaningful embeddings represent the relative relationships between data points, suggesting how activity transitions from one data point to another. In this study, we demonstrate cross-generation from brief spontaneous activity in both traditional in vitro data and the in vivo data provided by the International Brain Laboratory. Choosing the optimal embedding dimensionality is always a challenging problem. If one spatially maps neural-activity similarity by directly comparing it to actual spatial distances, both in vitro and in vivo points would naturally lie in a three-dimensional space. Hence, embedding the two datasets (in vitro and in vivo) in three to four dimensions is reasonably justified for their comparison in this work. However, should the number of datasets under comparison grow to three, four, five, or more, new justifications will be required to determine whether a three-dimensional visualization remains appropriate.
The second region of interest is the lateral pre-optic area (LPO) in vivo. This region appeared to be among the stronger seeds for predicting other brain regions in our dataset. Please note that the abbreviation “LPOR” used in
Figure 4 and
Table 1 refers to the left postrhinal area, which is a different region. The abbreviation for the lateral preoptic area in this context is “LLatPreopt”.
The LPO, a hypothalamic nucleus, is one of the most extensively connected subdivisions of the hypothalamus: in mice, it projects to and receives input from over 200 gray-matter regions, with intra-hypothalamic connections being especially prominent [
30]. Among its major outputs are the lateral habenula (LHb), septal nuclei, ventral tegmental area (VTA), dorsal raphe nucleus, and the rostromedial tegmental nucleus (RMTg) [
30].
The LPO is involved in both reward-related processing and the regulation of sleep–wake states [
31,
32]. Since sleep and wakefulness are fundamental behavioral states that entail wide-ranging shifts in brain function, the LPO’s connectivity to arousal and motivational systems is likely to be important. Recent work has also shown functional coupling between the LPO and the reward system: stimulation of the LPO suppresses GABAergic neurons in the VTA while increasing the firing rate of dopaminergic neurons [
31].
Taken together, these observations suggest that the LPO is well positioned to influence broad brain-state variables such as arousal, sleep–wake regulation, and reward-related signaling [
30,
31,
32]. This provides one plausible interpretation for why the LPO emerged as a useful seed in our data-driven analysis, although the present study does not establish a specific mechanistic pathway.
4.3. Future Challenges: Expanding the Range of Applications
Based on these findings, two efficient strategies can be proposed.
First, the “proximity map” based on relative similarities between datasets, expressed as a network diagram in
Figure 4f, is very important. This is because the proximity map encodes relative relationships among datasets; when a measurement is sparse or unavailable, the framework can propose concrete substitutes—for example, selecting nearby datasets, mixing closely related datasets, or using the generative model to translate between conditions along the map’s geometry. These are approximations of missing conditions rather than de novo creation, and they require task-specific validation.
From the standpoint of the 3Rs, this capability is a foundational technology that leads to the reduction in redundant experiments when existing results have been independently reproduced, while also prioritizing follow-up studies. That said, decisions to forgo new experiments should be based on predefined criteria and ethical review, and this method can be regarded as a technology that provides evidence for deliberations in that ethical review. In the future, as the number of nodes (datasets) in the network increases and network density grows within the informatics framework, the accuracy of generating non-existent data will steadily improve.
Second, prioritizing LPO measurements before expanding to other brain regions may enhance predictive accuracy and experimental efficiency because the lateral preoptic area data provides good seeds for generating neural activity across many regions. This strategy aligns with the 3R principle (Replacement, Reduction, Refinement) in animal research and could improve efficiency in human neurophysiological studies. The complete explanation of why this region’s neural activity can serve as such a versatile learning data seed (independent of the aforementioned proximity) remains unclear. As our understanding deepens, generation without requiring target data is expected to become increasingly feasible.
In the future, when considering the contribution of the LPO, some researchers in the life sciences may envision experiments involving optogenetic stimulation of this region. However, what is truly essential is to elucidate the “codes” utilized in the process of generating neural activity. In other words, it is crucial to uncover the information acquired by artificial neural networks through learning. To deepen our understanding of such phenomena, we expanded the interpretational scope of the transformer’s internal structure from attention maps to attention-weighted importance. Future challenges include further expanding this analysis and extracting and conducting detailed analysis of features from attention maps and importance that contribute to prediction generation.
In addition, it is important to improve methods for enhanced accuracy. Simple improvements include adding position encoding to the transformer model. As the computational method itself is scalable, expanding computational resources, such as computer memory, to increase the analyzable number of cells and time duration is also an important direction. While we performed mutual generation based on spontaneous activity, extending this to generate in vivo brain activity during stimulus presentation is another crucial direction. This can naturally be pursued by inputting in vivo spontaneous activity and stimulus information into a multi-modal AI model.