1. Introduction
Conventional Boolean networks are celebrated for interpretability and finite-state dynamics (attractors), but they are usually fixed rule systems. In contrast, artificial neural networks (ANNs) learn end-to-end via gradient descent. This work asks a simple question with practical ramifications:
Can a Probabilistic Boolean Network be trained to learn like an ANN while retaining logic-level interpretability?
Two challenges prevent a straightforward “yes.” First, structural choices in a PBN, like selecting which predictors to read and which logical operator to apply, are discrete and non-differentiable. Second, continuous targets (regression, policy values) require mapping Boolean computations to the reals.
Enter our approach: we introduce the Learning PBN (LPBN): each Boolean unit stochastically selects two input bits and a logical operator (AND/OR/XOR/NAND) from categorical distributions. The distributions’ logits are trainable parameters. The Boolean outputs feed simple differentiable heads (linear/softmax/policy). We optimize the heads with ordinary gradients and the structure with REINFORCE using an exponential-moving-average baseline to reduce variance.
This work addresses a key limitation of traditional PBNs: the difficulty in optimizing their non-differentiable structure in data-driven settings. Our objective is twofold: (1) to introduce a differentiable framework that enables learning over PBN structures using policy-gradient methods, thereby improving their trainability, and (2) to extend the utility of PBNs beyond biological modeling into general-purpose interpretable learning systems. The proposed Learning-based PBN (LPBN) combines the logical interpretability of Boolean networks with the optimization capabilities of deep learning, offering a principled approach to structure learning in stochastic logic-based models.
This paper presents the following contributions:
A precise mathematical formulation for Learning Probabilistic Boolean Networks.
Unbiased structural gradients with practical variance reduction.
A universal approximation result over discretized inputs and discussion of approximation bias/variance.
A multi-paradigm evaluation vs. ANNs across classification, regression, clustering-via-representation, and RL.
Ablation studies (bins, operators, width) and interpretability analyses via extracted rules.
2. Related Work
2.1. Boolean and Probabilistic Boolean Networks
Kauffman’s Boolean networks (BNs) [
1] established discrete, rule-based models of gene regulation whose long-run dynamics concentrate on attractor sets (limit cycles and fixed points) that have been linked to cellular “phenotypes”. Probabilistic Boolean Networks (PBNs) generalize BNs by assigning a distribution over predictor functions per node; the induced stochastic switching makes the network a finite-state Markov chain whose stationary measure allocates probability mass to attractor basins [
2,
3,
4]. Analytical treatments connect network structure to steady-state behavior, enabling computation of attractor probabilities and residence times [
5]. This paper builds on that line by treating the selector distributions in a PBN as learnable objects, optimized from task loss—classification error, regression risk, or return—rather than being fixed from prior biological knowledge.
2.2. Intervention and Control in PBNs
Because PBNs define a Markov chain, optimal interventions [
4,
6,
7] can be framed as Markov decision/control [
8] problems over the state space or reduced abstractions. Early work examined structural interventions—altering predictor sets or their selection probabilities—to shift stationary behavior [
4]. Subsequent methods considered context-sensitive PBNs and derived intervention policies that minimize the long-run probability of undesirable states [
6] as well as state space reduction to make intervention tractable [
7]. Optimal-control formulations using dynamic programming further grounded the design of intervention policies [
8]. Compared to these control-theoretic approaches, our learning-centric treatment uses policy-gradient style updates to fit selector distributions directly to supervised and reinforcement objectives, while preserving the interpretability of discrete predictors and attractor structure.
2.3. Learning with Discrete Choices: Gradient Estimators and Relaxations
Training models with discrete latent decisions is classically approached with score-function estimators (REINFORCE; [
9]), often stabilized by baselines/advantages. Modern variance-reduction techniques, REBAR [
10] and RELAX [
11], and continuous relaxations, Gumbel-Softmax/Concrete [
12,
13], enable lower-variance or differentiable surrogates for categorical sampling. Straight-through estimators provide a pragmatic alternative when exact gradients are unavailable [
14]. Our LPBN training objective instantiates this toolbox directly: selector logits for (i) which input each node reads and (ii) which Boolean operator it applies are updated with advantage-weighted score-function gradients; optional temperature-controlled relaxations can replace hard sampling for end-to-end differentiability.
2.4. Neuro-Symbolic and Logical Learners
Several propositional/relational learners connect logic with gradient-based training. The Tsetlin Machine learns propositional clauses with bandit dynamics and has shown competitive accuracy on tabular/text tasks while remaining interpretable [
15]. Neural Logic Machines [
16] and Differentiable Proving [
17] explore differentiable reasoning over rules, whereas Logic Tensor Networks [
18,
19] integrate logical constraints within neural architectures. LPBNs are complementary: they maintain strictly Boolean internal computations (AND/OR/XOR/NAND over binarized features) but learn the composition by optimizing selector distributions. This keeps interpretability at the node/logic level while enabling multi-task training akin to neural networks.
2.5. Reinforcement Learning and Deep RL Foundations
Policy-gradient RL provides the theoretical underpinning for optimizing stochastic decision rules with respect to expected return [
20]. Benchmark environments such as CartPole from OpenAI Gym [
21] standardize evaluation of controller learning. In our DRL experiments, LPBNs act as stochastic policies over discrete logical features of a discretized observation, trained with advantage-weighted score-function updates; neural baselines use shallow or two-layer MLP policies trained with the same on-policy objective, isolating the effect of the representation (logical vs. continuous).
2.6. Clustering and Unsupervised Evaluation
For unsupervised structure learning, we follow established external and internal indices. The Adjusted Rand Index (ARI) measures agreement between predicted partitions and ground truth, corrected for chance [
22]. The Silhouette coefficient [
23] and Calinski–Harabasz index [
24] quantify cohesion-vs.-separation without labels. When we compare LPBN-derived features to MLP embeddings under k-means, ARI provides the primary, label-aware criterion, with Silhouette/CH as label-free diagnostics.
2.7. Final Remarks
Beyond classical interventions and attractor control, recent research has addressed several advanced aspects of PBN control. For example, minimum observability problems aim to identify the smallest subset of nodes whose states must be monitored to ensure system controllability [
25]. Robust set stability refers to conditions under which a set of network states remains stable under probabilistic perturbations and intervention noise [
26]. Related to this is the notion of set stabilization, which focuses on driving the system toward a desirable subset of states with bounded error under stochastic dynamics [
27]. These developments are especially relevant in practical applications such as gene therapy or therapeutic reprogramming, where interventions must remain effective under uncertainty. Our work, while not focused on the control problem directly, complements this literature by proposing a data-driven way to learn the probabilistic structure of Boolean dynamics, which may serve as a foundation for future interpretable control policies.
3. Materials and Methods
Given
, standardize each of the dimensions and partition
B bins. Concatenate one-hot bin indicators to obtain
a Boolean substrate on which the network will operate. Equation (1) models the output of a Boolean node
xi in a probabilistic Boolean network. In traditional PBNs, each node is governed by a set of possible Boolean functions, and one is selected probabilistically at each update. Here, instead of fixed candidate functions, the stochasticity is embedded in the logic structure itself, defined by a distribution over logical triplets (
a,
b,
o). The expectation in Equation (1) thus captures the average output of node
xi over the stochastic structure space, akin to the ensemble behavior of classical PBNs with randomly selected update functions. Equation (1) encodes the system as a one-hot binary vector input to the PBN. This serves as the input substrate analogous to the Boolean state vector used in standard PBN formalism. Discretization trades smoothness for interpretability and lets us compose exact logical operations.
3.1. Stochastic Logical Units
The LPBN has
N Boolean units. Unit
n selects two input positions and one operator:
with categorical distributions
given the sampled (
), the unit emits
where
where
are input indexes and
is a logic operator (AND, OR, XOR, etc.). Each node
n in the network samples its two input connections independently;
and
are node indices drawn from categorical distributions
and
, respectively. These distributions assign probabilities over the set of possible source nodes in the previous layer. The operator
is sampled from
, a categorical distribution over a fixed set of Boolean Operators
.
is the combined policy for structural selection for unit
n,
are categorical distributions on inputs and outputs,
is the result of applying operator
over two selected bits, and
. Collect
h =
. Given the sampled inputs
and
and operator
, the output of node
n is computed as
, where
are the activations of the selected input nodes.
Each unit learns what to look at (two discretized features) and how to combine them (logic). As training proceeds, the categorical posteriors sharpen, revealing a rule-like structure.
Output Heads
Binary classification: , cross-entropy loss.
Multiclass: , negative log-likelihood.
Regression: = logits .
where denotes the set of parameters that govern the structure distributions for node n, including the logits for .
3.2. Learning Rules
Continuous parameters (
w,
b,
W,
) use standard gradients of the task objectives (e.g., SGD or Adam). Structural logits
are trained with a score-function estimator. Let
F denote the scalar objective (use
F = −
for supervised tasks;
F =
R for RL) and
b a baseline. Let
denote the number of input features and
R the number of output bits for the Boolean Network. Each output unit computes a binary value based on its logic structure and input activations. The vector of output values produced by the Boolean Network is denoted by
R . Then
Similar expressions follow and hold for These updates are applied per mini-batch (supervised) or per trajectory (RL).
3.3. Variance Reduction and Stabilization
We model a Learning Probabilistic Boolean Network as a stochastic computation graph over discrete selectors. For an input of
, every node
n draws a selector triple
where
are index input literals and
is a Boolean operator. With the (deterministic) Boolean map
and a linear readout
, we can define a task loss of
. The training objective is the selector-marginalized risk
3.4. Unbiased Gradient for Discrete Selectors (REINFORCE)
Differentiating
J with respect to selector parameters
yields the score-function estimator from Williams (1992) [
9]:
Replacing the intractable expectation by Monte Carlo samples () gives an unbiased gradient, but with high variance.
3.5. Control-Variate Baseline and “Advantage”
Let
F =
and
g = (
F −
b)
with any baseline
b independent of the sampled
S. Because
[
] = 0, we have that
(unbiased). This is the standard REINFORCE-with-baseline form used in both the supervised and RL training. For a shared scalar baseline, the variance-minimizing constant is
and it is obtained by differentiating Var[
g] with respect to
b. In practice, we approximate it online by an exponential-moving average (EMA),
with
, which substantially reduces gradient variance without bias.
3.6. Entropy Regularization to Prevent Premature Collapse
LPBN selectors are categorical; early collapse
kills exploration and yields vanishing gradients on unchosen options. Following maximum-entropy RL, we add an entropy bonus
such that
Annealing over training preserves exploration initially and allows confident selection later.
3.7. Rao-Blackwellization over Independent Factors
When selectors factorize across nodes,
, replacing
F with the conditional expectation,
in the update for
yields a Rao-Blackwellized estimator with provably lower variance
and Var[R-B] ≤
Var[
original] by the law of total variance. In practice, we approximate
with a small learned critic conditioned on node-local context (a one-layer regressor), which is the same control-variate idea as below.
3.8. Continuous Relaxations and Low-Variance Control Variates (REBAR/RELAX)
For categorical selectors, the Gumbel-Max trick and the Concrete/Gumbel-Softmax relaxation can be exploited to form a differentiable surrogate
with a temperature of
REBAR constructs an unbiased gradient through the combination of a reparametrized path through the relaxation with a control-variate correction
where
is the loss evaluated on the relaxed sample with common random numbers (same Gumbel noise) to maximize the correlation with F. RELAX generalizes this by learning a control variate
to minimize the variance of the resulting estimator; the final gradient remains unbiased while often achieving order-of-magnitude variance reductions on discrete graphs.
3.9. Straight-Through Estimators
The straight-through estimator (ST) enables gradient-based optimization over non-differentiable binary or discrete functions by approximating gradients during backpropagation. While the forward pass uses hard thresholding (e.g., sign or step functions), the backward pass substitutes their gradients with continuous surrogates—commonly the identity or clipped derivatives of a smooth approximation. This heuristic stems from early work by Hinton and was formalized by [
14] as a low-variance, computationally efficient proxy. The core justification is that, under certain conditions, the direction of the ST update correlates positively with the true expected gradient, enabling convergence in stochastic optimization despite non-differentiability.
A well-known limitation of the ST is its biased gradient estimation, which arises from ignoring higher-order interactions between discrete variables and the loss surface. This bias can accumulate when binary units saturate (i.e., become nearly deterministic), undermining learning. Entropy regularization directly addresses this by encouraging intermediate activation probabilities, thus keeping the network in a regime where the surrogate gradients remain informative. This approach has proven effective in learning binary representations in variational autoencoders [
12] and binarized neural networks [
28], where entropic penalties or temperature annealing maintain trainability through the ST.
In comparison to alternative estimators—such as REINFORCE or score-function gradients—the ST provides substantially lower variance and faster convergence. REINFORCE, though unbiased, is highly noisy and often infeasible for high-dimensional or deep architectures. Other reparameterization tricks for discrete variables, like Gumbel-Softmax or Concrete distributions, offer more principled solutions, but often suffer from temperature tuning issues, mismatch at inference time, or incompatibility with hard logic operators. In logic-based networks like LPBNs, the deterministic nature of logical gates aligns better with the ST’s forward discreteness and backward differentiability than with continuous relaxations.
Nonetheless, the ST is not universally robust. Its performance degrades when networks collapse to extreme binary probabilities early in training, or when gradient signal vanishes due to poor surrogate choice. These issues can be alleviated through techniques such as entropic smoothing, stochastic dropout, or delayed hardening of outputs via temperature scheduling. Given these mitigations, and the increasing evidence of STE’s success in hybrid neuro-symbolic systems [
29], its adoption in PBN training offers a justified balance of tractability and functional fidelity. This makes it a pragmatic and theoretically sound choice for backpropagation in probabilistic logic networks.
3.10. Putting Everything Together for a PBN
For supervised learning, use
F = −
. Update
w by standard gradients on
Update selector logits
with
(the EMA baseline), plus entropy. For a difficult task, replace
F with REBAR/RELAX control-variate forms. For Reinforcement Learning, let
F be the (discounted) return or a GAE-style advantage; identical selector updates apply. This matches the policy-gradient framework of [
20], with
the policy over discrete logical features extracted by the LPBN. Under standard Robbins–Monro conditions on step sizes and bounded second moments of
g, the stochastic approximation on
θ converges to a stationary point of
Jλ. In practice, (i) EMA baselines and Rao–Blackwellization/critics reduce gradient variance; (ii) entropy annealing prevents early mode collapse; and (iii) REBAR/RELAX provide the largest additional gains when
F is sparse or highly non-smooth, exactly the regimes where discrete logical choices are appealing.
3.11. Computational Complexity
A forward pass is O(N) Boolean ops + O(N) dot-product for the head. Sampling from m-way and 4-way categoricals per unit is O(m) if naive; with alias tables or top-k constraints, it is near O(1) amortized.
4. Model and Objective
Let
denote a binarized input obtained from raw
by one-hot quantization or thresholding. An LPBN with
N internal nodes consists of selector triples
drawn from a categorical policy
with parameters
. Given
node
n outputs a Boolean feature,
and we collect
. A linear readout produces a task score of
. For supervised learning with example (
x,
y)~
D and a loss of
our risk is
where
is the dataset of input vectors
and the corresponding targets
, used to compute the expected loss.
For reinforcement learning,
is the policy overs selectors, or selector-parametrized action logits, and
is minus the expected/discounted return. Drawing
S at each time step yields a PBN, i.e., a finite-state Markov Chain on
whose kernel is determined by the selector distributions [
2,
3]. If the selectors are redrawn i.i.d. each step, the chain is homogeneous. A fixed selector draw recovers a deterministic BN. Steady-state and attractor behavior follow the standard Markov Chain theory [
4,
5,
6,
7,
8].
4.1. Expressivity
We will first record the representational capacity of the logical core.
Lemma 1.
(Functional Completeness): Because {NAND} is functionally complete, any Boolean function g:can be realized by a finite acyclic network of 2-input NAND gates. Sincecontains NAND, an LPBN of sufficient width/depth can realize any g on finitely many Boolean literals (including one-hot bins of real features).
Proof.
A standard result from switching logic: a multi-input AND/OR can be implemented by a binary tree of NANDs; negation is . For any finite set and any mapping , there exists an LPBN and a readout () that matches on X. We can use Lemma 1 to implement indicator functions for each via conjunctions of literals, the readout forms □
Proposition 1.
Let be compact andcontinuous. For anythere exists a quantizer q:,
an LPBN over q(x), and a readoutsuch thatProof: For each,
define Kthe indicator:
i.e., it is the conjunction of literals that match the bits of
. Through Lemma 1,
has a realization by NAND gates; chaining conjunctions produces a binary tree whose leaves are literals. Let the LPBN nodes compute
for all values of i. Choose the readout weights
and b = 0. Then
and b = 0. Then
on X.
Corollary 1.
(Uniform approximation on compact subsets of): Let Kbe compact andis continuous. For anythere is a finite partition {} of K into axis-aligned rectangles, a binarization of q:that encodes the cell index in one-hot form, an LPBN over q(x), and a readout such thatProof: Continuity on compact K implies uniform continuity. Hence there is a grid partition such that the oscillation fromon each cell Cm is at mostDefine q to output the one-hot code of the active cell. For each m, the indicatoris a conjunction of the one-hot- bits that correspond to Cm and is therefore realizable by NAND gates (see Lemma 1). Set the readout to, that is any representative value in the cell. Thenon every cell, establishing uniform approximation. Each LPBN node computes a named Boolean operator on explicit literals; proofs above constructively identify which literals (or cell bits) are used. This yields circuit-level explanations unavailable in standard continuous neurons.
4.2. Unbiased Gradients for Discrete Selectors
Define the instantaneous feedback as
(for supervised) or
, where
G is the return (for RL). By the log-derivative trick (REINFORCE), the gradient with respect to selector parameters is
A Monte Carlo simulation estimate of the samples is therefore unbiased. The readout parameter w admits ordinary gradients because is differentiable.
Lemma 2.
(baseline and advantage): Let b be any random variable independent of the sampled S conditional on x (e.g., a moving average). Then
meaning that subtracting
b preserves unbiasedness while reducing variance [
9,
20]. The variance-minimizing constant baseline is
Proof.
Score has a zero mean. Note that Therefore, .
Unbiasedness: Since b is independent of S given x, , so subtracting b does not change the expectation.
Optimal constant: Since the mean is independent of b, minimizing Var( is equivalent to minimizing . Consider the function . It is a strictly convex quadratic in b (unless = 0 a.s.). Differentiating and setting the derivative to zero, . If , the unique minimizer is , which minimizes and also Var(.□
4.3. Entropy Regularization
Adding with yields which prevents premature selector collapse and is standard in policy-gradient practice.
For categorical selectors, Concrete/Gumbel-Softmax relaxations [
10,
12] enable a low-variance pathwise term coupled with a score-function correction to keep the estimator unbiased (REBAR/RELAX). Denoting the relaxed sample by
(temperature
),
where you can optionally replace
by a learned control variate,
(RELAX) to further reduce variance without bias.
4.4. Variance Reduction via Rao-Blackwellization
If selector factors
across nodes, then
where replacing F by
in node-local updates is a Rao-Blackwellization that strictly reduces variance. In practice
can be approximated by a small critic (regressor) on node-local context-equivalently, a control-variate architecture [
12,
13].
4.5. Convergence: Stochastic Approximation
Let
represent any of the unbiased (or asymptotically unbiased) gradient estimators above, with
and
. With step sizes
satisfying
,
(Robbins-Monro conditions), the iterates
converge to the set of stationary points of
J almost surely under standard regularity (e.g., boundedness via projection) [
30,
31]. Entropy annealing preserves these conditions. For RL, the same conditions apply to the negative-return objective with policy-gradient estimators [
9,
20].
4.6. Reinforcement Learning: Policy-Gradient Form for LPBNs
Consider an episodic MDP with trajectories
and return
Let
be the LPBN policy (e.g., selectors parametrize action logits). The policy-gradient theorem specialized to LPBN yields
where
is any advantage signal (for example,
); the proof follows the REINFORCE identity plus the Markov property [
9,
20]. Thus, the same estimators and variance-reduction devices from
Section 4.3 and
Section 4.4 apply verbatim to LPBN policies in DRL.
4.7. Attractors and Stationary Behavior Under Learning
For a fixed selector policy
, the induced PBN on the discrete state space
has a transition matrix
. When
is irreducible and aperiodic, it admits a unique stationary distribution
(Perron-Frobenius theorem);
concentrates mass on attractor basins in proportion to expected residence times. Learning alters
. Under ergodicity and mild smoothness of
perturbations
yield Lipschitz-continuous changes in
and consequently in
(Markov Chain perturbation), which justifies gradient-based structural interventions [
4,
6,
7,
8] and our selector-probability updates: by increasing expected task reward (or decreasing loss), they reweight the attractor landscape toward a desirable steady-state behavior.
4.8. Practical Implications of Sample-Efficiency and Identifiability
Because node outputs are binary, gradients pass only through selector probabilities, not through real-valued non-linearities. This produces low-dimensional stochastic updates with strong regularization by discreteness, often improving small-data generalization. Entropy bonuses and control variates are crucial to avoid premature determination of selectors, which could stall exploration and harm identifiability of useful logic features [
9,
10,
11,
12,
13,
20].
5. Experimental Setup
5.1. Tasks and Data
We evaluate LPBNs against standard ANN baselines on six canonical tasks, spanning supervised, unsupervised, and control settings.
Binary Classification (synthetic): Inputs are drawn i.i.d. from a standard normal; labels are generated by a linear score with and . Train/validation/test split is 70/10/20.
Regression (synthetic): Same covariates as (1), with targets with ). Split as above.
Clustering (gaussian blobs): We sample K = 3 isotropic Gaussian clusters in with angularly separated means (pairwise angle 120°), then hide the labels after training to compute external clustering indices on embeddings.
Reinforcement Learning (LineWorld): A 1D chain of length L = 9 with terminal states at the ends; reward +1 on the right terminal, −1 on the left, 0 otherwise. Episodes terminate upon reaching a terminal; horizon 40.
Deep RL (CartPole): Classical inverted pendulum control with continuous state () and binary actions (left/right). Episodes terminate on pole angle or cart position violation; horizon 280.
Text (negation sensitivity): Synthetic sentiment sentences of the form “the product is [not] [intensifier] word.” Words are drawn from composition induced by negation.
All synthetic generators (1–3, 6) are fixed across seeds by publishing the PRNG seeds used in the experiments. For control tasks (4–5), we use deterministic physics and environment seeds, reported with results.
5.2. Preprocessing and Binarization
Before binning, continuous features are standardized to zero mean and unit variance on the training split only; the same affine map is applied to validation and test. We then perform equal-width binning per feature into
B bins (default
B = 5; ablation in
Section 6.4 uses
B ∈ {3, 7}), and one-hot encode each bin, yielding a binary vector
with
For CartPole we discretize each state component within a fixed physical range into
B = 8 bins per variable. LineWorld states are already one-hot.
5.3. Models
5.3.1. LPBN
An LPBN with N nodes samples selector triples with and Given q(x), node n computes A linear head yields logits for classification/regression. For RL, a policy head is used; for value baselines we use a linear critic on h. Let h be a deterministic policy mapping states to actions in the given environment. Vh denotes the state-value function under policy h, i.e., the expected cumulative reward obtained starting from the initial state when following policy h.
5.3.2. ANN Baselines
For supervised (1–2, 6): A 1-hidden-layer MLP with tanh non-linearity, width chosen to match LPBN parameter count within ±10% (details below). For clustering (3), the same MLP used as feature extractor; k-means is run on the hidden activations. For RL (4), a 1-hidden-layer softmax policy (same width matching rule). For DRL (5), a 2-hidden-layer MLP (40–40) policy, a common small baseline for CartPole.
Capacity matching: For each task, we select N (LPBN nodes) and MLP hidden width H so the total number of trainable parameters (selector logits + readout for LPBN; weights/biases for MLP) differs by at most 10%. Exact counts are reported with results.
5.4. Training Objectives
Classification/Text. Binary cross-entropy on f(h) (LPBN) or MLP logits.
Regression. Mean squared error.
Clustering. Unsupervised: Train the feature extractor to reconstruct inputs with a shallow linear decoder (identical for LPBN/MLP to avoid bias), then apply k-means on embeddings and evaluate externally (no label leakage).
RL/DRL. Episodic policy gradient with undiscounted return (LineWorld) and γ = 0.99 (CartPole). No entropy regularization is used at evaluation time; see below for training-time entropy.
5.5. Optimization and Variance Reduction
5.5.1. LPBN Selectors
Selector distributions are categorical with logits . We optimize with the score-function estimator (REINFORCE) using an exponential moving average (EMA) baseline where F is the negative loss (supervised) or return (RL). We use a unless explicitly stated. An entropy bonus is applied to prevent premature selector collapse, with linearly annealed from over the first half of the training ( for supervised/text, for RL/DRL). Readout weights w (and policy head V in RL) are trained with standard gradients. For ablations we include Gumbel-Softmax relaxations (temperature annealed 1.0 and report their effect.
5.5.2. ANN
For ANNs, all MLPs are trained with full-batch or mini-batch SGD (task-dependent batch sizes below) and fixed learning rates chosen by a small grid on the validation split; once chosen, the same rate is used for all seeds.
5.5.3. Learning Rates and Schedules
For supervised/text: LPBN . For clustering, same as supervised. For RL (LineWorld), LPBN policy head ANN For DRL (CartPole): LPBN . We clip advantages tp [−5, 5] and gradients to on RL/DRL to avoid occasional large updates.
5.6. Budgets and Training Protocol
Classification/Regression/Text: 50 epochs; batch size 64; early stopping disabled (fixed budget for fairness).
Clustering: 50 epochs; batch size 64; embeddings are taken from the last epoch (no model selection).
LineWorld: 500 episodes per seed; horizon 40.
CartPole: 300 episodes per seed; horizon 280; reporting uses a moving average over the last 100 episodes.
All models are trained with three independent seeds per configuration in the minimal runs and ten seeds in the main paper results, using the same seed set for LPBN and ANN to enable paired tests.
5.7. Evaluation Metrics
Classification/Text: Accuracy (primary) and AUROC (secondary).
Regression: RMSE on the test split.
Clustering: Adjusted Rand Index (primary), Silhouette and Calinski–Harabasz (secondary).
RL/DRL: Average episodic return; for CartPole we also report the fraction of episodes reaching the 200-step benchmark.
All reported test metrics are computed once per seed at the end of training (no multiple restarts per seed unless explicitly stated). For RL/DRL we report the mean of the last-100-episode moving average to smooth episodic variance.
5.8. Statistical Testing and Uncertainty
For every task we compute the paired difference for seed s. We report the across-seed mean , a bootstrap 95% confidence interval for the mean (10,000 bootstrap resamples with replacing), and a two-sided exact sign test p-value on {}. For regression we use -RMSE as the score so that larger is better uniformly across tasks; this keeps the sign of consistent. When testing all six tasks jointly we apply Holm–Bonferroni correction to sign-test p-values.
5.9. Implementation and Reproducibility
All models are implemented in pure Python 3.14.0/NumPy 2.3.4 with a fixed PRNG stack (Python ‘random’ and NumPy’s PCG64). No GPU or third-party autodif is used for LPBN; ANNs use the same linear algebra backend (no vendor-specific kernels) to avoid hardware confounds. We fix all seeds at the start of each run, log every hyperparameter and metric, and export run manifests (JSON) sufficient for full reproduction (dataset seeds; bin edges; model sizes; learning rates; entropy schedules; number of episodes/epochs; early-stop flags = off; clip thresholds; and evaluation windows). Environment dynamics for CartPole follow the canonical equations of motion; thresholds and integrator timestep match commonly used benchmarks.
5.10. Fairness Checks and Ablations
To isolate representation (logical vs. continuous) from capacity and budget effects, we enforce (i) parameter-count matching within ±10%, (ii) identical training budgets, and (iii) identical seed sets. We further run ablations on the number of bins B; LPBN node count N; entropy schedule ; and (for DRL) replacing REINFORCE by Concrete relaxations. These ablations quantify sensitivity and verify that conclusions are robust to reasonable hyperparameter variation.
6. Results
6.1. Reporting Conventions
Unless otherwise stated, results are means across independent seeds with 95% bias-corrected bootstrap CIs (10k resamples). For pairwise model comparisons we compute paired seed-wise differences We report the following:
For regression we use -RMSE as a score so that larger is always better. Multiple comparisons across tasks are controlled with Holm–Bonferroni. Where appropriate we perform TOST equivalence tests against pre-registered non-inferiority margins (classification regression .
The following
Table 1 represents the across-seed paired significance for six tasks.
6.2. Binary Classification (Synthetic)
Across datasets with axis-aligned or rule-like boundaries, LPBNs achieve test accuracy statistically indistinguishable from ANNs under TOST (equivalence accepted at
). On oblique, low-noise boundaries, the ANN retains a small but significant edge (Holm-adjusted
p < 0.05); increasing bin count
B and node count
N narrows this gap. The LPBN matches the ANN when the target can be expressed with sparse logical interactions in the binned space; otherwise, the approximation bias from discretization dominates. As shown in
Figure 1, LPBN matches ANN on axis-aligned boundaries; small gaps appear on oblique cases.
6.3. Regression (Synthetic)
LPBN reduces test RMSE relative to a mean baseline and approaches ANN performance as
B and
N increase, but ANN remains superior on smooth targets. Paired Δ favors ANN with small-to-moderate effect sizes; non-inferiority is often accepted for
B ≥ 6. Piecewise-constant bias is the main limiter; widening the logical basis compensates but increases variance.
Figure 2 summarizes -RMSE; LPBN narrows the gap as bin granularity increases.
6.4. Clustering/Representation (Gaussian Blobs)
Running k-means on LPBN features yields ARI comparable to ANN hidden activations; CIs overlap and sign tests are non-significant after Holm correction. Silhouette and Calinski–Harabasz agree, indicating well-separated embeddings in both models. Discrete logical features provide usable geometry for downstream unsupervised tasks.
In
Figure 3, ARI confidence intervals overlap; sign tests are non-significant.
6.5. Reinforcement Learning (LineWorld)
Both models learn a right-leaning policy. LPBN’s return curve lags during exploration but reaches the ANN’s plateau as selector entropies anneal; last-100-episode averages show no significant difference (Holm-adjusted p > 0). Advantage-weighted selector updates are sufficient for simple sparse-reward control provided entropy warm-up is used.
Figure 4 shows convergent returns and matching plateaus; entropy warm-ups matter early.
6.6. Deep RL (CartPole)
With discretized observations (
B = 8 per state coordinate), LPBN attains stable control. Final moving-average return is within the pre-specified equivalence band relative to a two-layer MLP policy on most seeds; on runs with early selector collapse, performance degrades unless entropy is annealed slowly. For low-dimensional control, LPBN policies are competitive when exploration is maintained; collapse prevention is critical. As per
Figure 5, LPBN reaches the 200-step threshold on most seeds; slow entropy annealing prevents collapse.
6.7. Text Classification (Negation Sensitivity)
LPBN captures negation via XOR/NAND combinations of the “not” token with polarity words, achieving test accuracy on par with the ANN baseline; feature audits show high-weight nodes implementing interpretable negation rules. LPBN’s explicit logical operators are advantageous for compositional linguistic phenomena (negation, conjunction). We extract the most influential Boolean units by absolute head weight and decode their MAP structures. Examples below show how LPBN captures negation patterns directly (
Figure 6).
CONST1 and
CONST0 allow unary behaviors (identity/NOT via XOR/AND).
Table 2 reports paired, seed-matched comparisons between the ANN baseline and the LPBN on each task. For every task we show the mean test score for each model with bootstrap 95% CIs, and the paired difference Δ = \Delta = Δ = ANN − LPBN with its 95% CI; because all metrics are on a “higher-is-better” scale (we report −RMSE for regression), positive
Δ favors the ANN and negative
Δ favors the LPBN. The sign test
p (with Holm adjustment across tasks) assesses whether one model wins more seeds than the other; small
p values indicate a consistent winner. To evaluate
practical parity, we use TOST with pre-specified equivalence margins (ε
acc = 0.02 for accuracy; for regression, ε = 0.05 × ∣mean ANN score∣). The Equivalence column is “Yes” only when both one-sided tests are significant at α = 0.05, i.e., the observed Δ is statistically confined within the margin. In short, if the Δ CI includes 0 and Equivalence = “Yes,” the models are practically indistinguishable on that task; if the Δ CI excludes 0 and the sign-test
p is small, the model indicated by the sign of Δ is the consistent winner across seeds.
7. Ablations
7.1. Binning Granularity B
Performance improves from B = 4→6 and plateaus by B = 8. Coarse binning underfits (excess bias); very fine binning induces sparsity, increasing gradient variance and harming stability. A U-shaped validation curve is observed; B ∈ {5, 6} is a robust default.
7.2. Operator Set
Removing XOR reduces accuracy on parity-like interactions (text negation, synthetic parity), confirming its necessity. NAND-only remains functionally complete but is less sample-efficient due to deeper effective circuits.
7.3. Number of Nodes N
Larger N reduces bias but raises variance; with stronger baselines and entropy warm-up, returns/accuracies improve monotonically up to the parameter-matched budget. Variance manifests as wider seed dispersion, especially in RL.
7.4. Variance Reduction
Ablating the EMA baseline causes unstable training and early selector collapse; adding entropy warm-up reliably improves last-five averages. Gumbel–Softmax relaxations (annealed τ) further reduce variance at the cost of a small bias; in DRL this trade-off is beneficial.
8. Interpretability
We decode each node’s MAP structure from selector posteriors and pair it with its readout weight . Rules are reported as follows:
Rule n:
We rank by ∣| and compute coverage (% test points for which the antecedent fires) and precision (class-conditional accuracy when firing). On text, high-precision rules capture not AND positive and not AND negative patterns; on classification, rules align with axis-aligned thresholds discovered by the ANN’s first layer but are human-readable without surrogate extraction.
9. Case Study: Modeling a Cell Cycle Network with LPBNs
9.1. Network Description and Setup
We consider the mammalian cell-cycle regulatory network—a 10-gene Boolean model governing progression through G1, S, G2, and M phases [
32]. Nodes include Cyclin D (CycD), Cyclin E (CycE), Retinoblastoma (Rb), E2F, Cyclin A (CycA), p27, Cdc20, UbcH10, Cdh1, and Cyclin B (CycB). The network’s logical rules encode known regulatory interactions (e.g.,
CycE activates when E2F is high and Rb is inactive;
Rb remains active in the absence of cyclins or in the presence of p27;
E2F is released when Rb is off), giving rise to characteristic attractors—notably a quiescent fixed point and an oscillatory 7-state cycle—commonly used as benchmarks in Boolean/PBN studies [
32,
33].
To cast this deterministic Boolean model as a probabilistic Boolean network (PBN), we introduce
random perturbations with a small per-bit flip probability
p at each time step, yielding a finite, ergodic Markov chain with a well-defined stationary distribution over the 2
10 states [
3]. Intuitively, the stationary mass concentrates near the two attractor basins. We generated synthetic trajectories (e.g., 10
5 time steps from random initial states, with
p = 0.001) and recorded state transitions as training data.
We configured an LPBN with the same 10 Boolean nodes to learn the state-transition dynamics. Each node stochastically selects two inputs and a Boolean operator (AND/OR/XOR/NAND), as per our method, and each gene’s next-state probability is produced by a logistic head. We optimized model parameters to minimize cross-entropy between LPBN one-step transition probabilities and the empirical transitions from the real network. Structural parameters (input indices and operator choice) were trained with policy-gradient updates, enabling the LPBN to mimic the original rule-based dynamics.
9.2. Performance Metrics
We evaluated the learned LPBN against the real network on (i) attractor structure (extracted by simulating with
p = 0 to identify recurrent states and cycles) and (ii) steady-state distribution (long-run frequencies under small perturbations,
p = 0.001). For the distributional comparison, we computed the Kullback–Leibler divergence between stationary distributions and reported attractor-level probabilities (mass on the fixed point; total mass across the 7 cycle states), a standard practice in PBN analyses [
2,
33].
9.3. Results and Findings
9.3.1. Attractor Recovery
After training, the LPBN reproduced the same two attractors observed in the biological model: a 7-state oscillatory cycle and a quiescent fixed point. The qualitative gene-activation patterns within each attractor matched the known logic: CycD persistently high across the cycle; CycE → CycA → CycB activation sequence; Cdh1/Cdc20 controlling Cyclin degradation; and a fixed point characterized by high Rb, p27, and Cdh1. Learned per-node rules were interpretable (e.g.,
CycE effectively captured ¬\lnot¬Rb ∧ E2F;
CycB activated when ¬\lnot¬(Cdh1 ∨ Cdc20)), reflecting the original rule set [
32].
9.3.2. Steady-State Distribution
With small perturbations, the learned LPBN’s stationary distribution closely matched that of the real PBN, allocating substantial probability to the quiescent state and comparatively small but non-negligible mass to the 7 cycle states; the KL divergence between full distributions was small, indicating a high-fidelity approximation of global dynamics. In short, the LPBN internalized the rule-based behavior and reproduced both long-run distributions and attractor structure observed in the benchmark network [
2,
33].
9.4. Conclusions
This case study shows that an LPBN can be trained to faithfully replicate a real PBN’s dynamics for a nontrivial biological network, preserving logic-level interpretability while achieving data-driven fidelity in both attractors and stationary behavior [
2,
32,
33].
10. Discussion
LPBNs validate the hypothesis that probabilistic structure can be trained as a policy: selector mass concentrates on useful inputs/operators while a linear/policy head adapts with standard gradients. The method does the following:
Approaches ANN performance on tasks with logical or thresholded structure;
Remains competitive in low-dimensional control when exploration is preserved;
Yields rule-level explanations by construction.
where targets are smooth and highly continuous, ANNs retain an advantage unless B and N are increased, which in turn demands stronger variance control.
11. Limitations
Discretization bias: Piecewise-constant approximations limit regression on smooth functions.
Gradient variance: Discrete selectors require careful baselines/entropy; without them, collapse can occur.
Scale-up: The structural search space grows with dB and depth; multi-layer LPBNs magnify this.
Benchmark scope: Results emphasize tabular/small-state control; extension to high-dimensional vision is non-trivial.
Logical transparency aids auditability and post hoc analysis, but discretization may encode proxy attributes even when sensitive fields are removed. We recommend reporting group-wise metrics, running counterfactual tests on bin boundaries, and documenting data provenance and licenses. When models are used for decisions affecting people, fairness assessments and error analyses per subgroup are required.
While our experiments demonstrate the effectiveness of the proposed LPBN model on synthetic and machine learning benchmarks, we acknowledge the importance of validating the method on real-world probabilistic Boolean networks, such as those derived from gene regulatory data. Future work will focus on applying the framework to biologically validated PBNs like those reported in [
2], where attractor dynamics and perturbation analysis are critical. We anticipate that the LPBN’s differentiable structure optimization will allow better model fitting and more robust prediction in such domains.
12. Conclusions
This work introduced Learning Probabilistic Boolean Networks (LPBNs): stochastic, rule-based learners that optimize discrete structural choices (input selection and Boolean operator per node) with policy-gradient estimators, while training continuous heads (linear/policy/value) by standard gradients. The formulation unifies the classical PBN view, as a finite-state Markov chain with attractor structure, with modern variance-reduced learning for discrete decisions. In supervised learning, LPBNs minimize expected loss over selector policies; in DRL, the same policies serve as stochastic controllers over logical features of the state.
12.1. Methodological Contributions
A task-level objective that marginalizes over selector distributions, yielding unbiased score-function gradients with EMA baselines, entropy regularization, and optional REBAR/RELAX control variates;
A factorized selector parameterization enabling node-wise Rao–Blackwellization;
A simple, reproducible training recipe (binning, variance control, entropy annealing) that stabilizes discrete structure learning across tasks;
An interpretability protocol that decodes node-level rules (input literals and operator) with coverage/precision statistics.
12.2. Findings
Across six canonical settings—binary classification, regression, clustering (ARI), RL on LineWorld, DRL on CartPole, and text negation—LPBNs are competitive with ANNs when targets admit sparse logical structure in the binned space, and they remain viable controllers for low-dimensional DRL with proper exploration control. Where targets are smooth and highly continuous, ANNs retain an advantage unless bin granularity and node count are increased, which in turn elevates gradient variance—precisely the regime where control variates help. Clustering results show that discrete logical features can support geometry suitable for downstream unsupervised tasks. In text, LPBN rules recover interpretable negation patterns (e.g., NAND/XOR compositions of “not” with polarity lexemes) without surrogate models.
12.3. Interpretability and Auditability
Unlike continuous neurons, each LPBN node computes an explicit, named Boolean, operator over identified input bins. This yields direct, local explanations, human-readable propositions with weights, facilitating audits and error analyses (e.g., rule coverage/precision on test data), and enabling targeted structural interventions in the spirit of classical PBN control.
12.4. Limitations and Threats to Validity
LPBNs inherit discretization bias: piecewise-constant approximations can underfit smooth functions unless binning is sufficiently fine. The search over discrete structures introduces gradient variance; without baselines and entropy warm-up, selectors can collapse prematurely. Our evaluations emphasize tabular/small-state control and synthetic text; extension to high-dimensional perception (vision/audio) is non-trivial. Finally, while parameter counts and training budgets were matched, architectural inductive biases differ; we therefore used paired seeds and nonparametric tests, but residual confounders cannot be ruled out. We should also continue the work started in [
34,
35,
36,
37,
38,
39] to refine the LPBN model and adjust it for different scenarios.