This paper provides the mathematical foundations and formal proofs underlying the Distributed Autonomous Neuron (DAN) Theory. We address in full the six open problems identified in the prior framework: (1) we formally define the binding function $R(n_j, \text{context})$ via a phase-coherence operator on the temporal spike graph, and prove it satisfies local autonomy without a central coordinator; (2) we prove the global termination condition emerges from local energy descent without requiring a central aggregator, via a Lyapunov stability argument; (3) we specify and prove properties of the node transformation function $\phi$ as a parameterized gated recurrent unit with local gradient dynamics; (4) we provide a formal developmental initialization model grounded in random graph theory and genetic encoding constraints; (5) we derive necessary and sufficient hardware substrate conditions from first principles; and (6) we propose a complete empirical validation protocol with falsifiable predictions. Together these results elevate DAN Theory from a conceptual framework to a mathematically complete and testable model of biological cognition.
We work in the framework of the dynamic directed weighted graph $G = (N, E, W, T)$ defined in the prior paper. We extend this with the following formal objects required for proofs.
For neuron $n_i \in N$, its spike train over time interval $[0, \mathcal{T}]$ is the measure: $$\sigma_i(t) = \sum_{k} \delta(t - t_i^{(k)})$$ where $t_i^{(k)}$ is the time of the $k$-th spike of $n_i$, and $\delta$ is the Dirac delta. The spike train lives in the space $\mathcal{S}$ of tempered distributions on $\mathbb{R}^+$.
The complete local state of node $n_i$ at time $t$ is the quadruple: $$s_i(t) = \bigl(x_i(t),\; w_i(t),\; d_i(t),\; P_i(t)\bigr) \in \mathbb{R}^{d_x} \times \mathbb{R}^{d_w} \times \mathcal{D} \times \Delta^{|A(n_i)|}$$ where $\Delta^k$ denotes the $k$-simplex (probability distributions over $k$ neighbors), and $\mathcal{D}$ is the space of local data representations.
A thought-traversal $\mathcal{T}_q$ initiated at trigger node $n_0$ is a stochastic process on $G$: $$\mathcal{T}_q = \{(n_{\tau(0)}, t_0), (n_{\tau(1)}, t_1), \ldots\}$$ where $\tau$ is the (random) sequence of activated node indices and $t_k$ are activation times. The traversal is a non-homogeneous Markov chain on $N$ with transition kernel $K_i(n_j \mid s_i, \text{context})$.
The system satisfies local autonomy if and only if: for every node $n_i$, its state update $s_i(t + \Delta t)$ is a function only of $s_i(t)$ and signals received from $A(n_i) \cup A^{-1}(n_i)$ (immediate neighbors and predecessors). Formally: $$s_i(t+\Delta t) = F_i\bigl(s_i(t),\; \{m_{ji}(t) : n_j \in A^{-1}(n_i)\}\bigr)$$ where $m_{ji}$ is the message from $n_j$ to $n_i$, and $F_i$ depends on no global state.
We will use $\|\cdot\|$ for the Euclidean norm, $\langle \cdot, \cdot \rangle$ for inner product, $\mathbb{E}[\cdot]$ for expectation, and $\mathbf{1}[\cdot]$ for indicator functions throughout.
The first and primary open problem of DAN Theory was the formal definition of the contextual relevance function $R(n_j, \text{context})$: the mechanism by which a node's local probability distribution is modulated by non-local traversal context, without violating local autonomy. We resolve this completely by defining binding as a phase-coherence operator on the temporal spike graph.
For neuron $n_i$ with spike train $\sigma_i(t)$, define the instantaneous phase $\theta_i(t) \in [0, 2\pi)$ via the Hilbert transform of the band-pass filtered spike train in the $\gamma$-band $[30, 80]$ Hz: $$\theta_i(t) = \arg\bigl(\mathcal{H}[\sigma_i^{(\gamma)}](t)\bigr)$$ where $\sigma_i^{(\gamma)}$ is the $\gamma$-band component and $\mathcal{H}$ denotes the Hilbert transform.
The phase-coherence between neurons $n_i$ and $n_j$ over window $[t - \Delta, t]$ is: $$\rho_{ij}(t) = \left| \frac{1}{\Delta} \int_{t-\Delta}^{t} e^{i(\theta_i(s) - \theta_j(s))}\, ds \right| \in [0,1]$$ $\rho_{ij} = 1$ indicates perfect phase-locking; $\rho_{ij} = 0$ indicates incoherence.
For threshold $\rho^* \in (0,1)$, the coherence neighborhood of $n_i$ at time $t$ is: $$\mathcal{C}_i(t) = \{n_j \in N : \rho_{ij}(t) \geq \rho^*\}$$ This is the set of neurons currently phase-locked with $n_i$. Note: $n_j \in \mathcal{C}_i(t)$ if and only if $n_i \in \mathcal{C}_j(t)$ — the relation is symmetric.
We can now define the binding function formally. The key insight: a node computes its coherence neighborhood $\mathcal{C}_i(t)$ using only its own spike train and the spike trains received from its neighbors (which arrive as synaptic input). This requires no global state.
The contextual relevance function $R : N \times \mathcal{C} \to \mathbb{R}$ is defined as: $$R(n_j, \mathcal{C}_i(t)) = \rho_{ij}(t) \cdot \mathbf{1}[n_j \in \mathcal{C}_i(t)] \cdot \exp\!\left(-\frac{\|w_{ij}\|}{\lambda}\right)$$ where $\lambda > 0$ is a decay constant governing the synaptic distance penalty.
The full probability distribution over next-node selection becomes: $$P_i(n_j \mid \mathcal{C}_i(t)) = \frac{\exp\bigl(W(n_i \to n_j) + \beta \cdot R(n_j, \mathcal{C}_i(t))\bigr)}{\sum_{n_k \in A(n_i)} \exp\bigl(W(n_i \to n_k) + \beta \cdot R(n_k, \mathcal{C}_i(t))\bigr)}$$ where $\beta > 0$ controls the influence of coherence on selection probability.
The binding function $R(n_j, \mathcal{C}_i(t))$ as defined in Definition 2.4 satisfies the local autonomy condition of Definition 1.4. That is, node $n_i$ can compute $R(n_j, \cdot)$ for all $n_j \in A(n_i)$ using only locally available information.
We must show that $n_i$ can compute $\rho_{ij}(t)$ for each $n_j \in A(n_i)$ without global state.
Step 1. Node $n_i$ has direct access to its own spike train $\sigma_i(t)$ and hence can compute $\theta_i(t)$ via a local Hilbert transform (a causal filter applied to its own firing history).
Step 2. For each neighbor $n_j \in A(n_i)$, the post-synaptic potential at $n_i$ from $n_j$ is: $$u_{ji}(t) = \int_0^\infty h(\tau)\, \sigma_j(t - \tau)\, d\tau$$ where $h(\tau)$ is the synaptic kernel. Since $n_j \to n_i$ is a direct synaptic connection, $\sigma_j(t)$ arrives at $n_i$ as post-synaptic input — this is precisely the message $m_{ji}(t)$ in Definition 1.4.
Step 3. From $u_{ji}(t)$, node $n_i$ can reconstruct $\sigma_j^{(\gamma)}(t)$ by applying the same $\gamma$-band filter, then compute $\theta_j(t) = \arg(\mathcal{H}[u_{ji}^{(\gamma)}](t))$ locally.
Step 4. With $\theta_i(t)$ and $\theta_j(t)$ both computable from local information, $\rho_{ij}(t)$ is computed by the integral in Definition 2.2 over a sliding window — a purely local operation on $n_i$'s state.
Step 5. The indicator $\mathbf{1}[n_j \in \mathcal{C}_i(t)]$ and the weight $\|w_{ij}\|$ are both elements of $n_i$'s local state $s_i(t)$.
Therefore $R(n_j, \mathcal{C}_i(t))$ is a function only of $\{s_i(t), m_{ji}(t)\}$, satisfying Definition 1.4. $\square$
Let $\mathcal{T}_q$ be a thought-traversal with active subgraph $S_q(t) \subseteq N$. Then the set of nodes simultaneously phase-locked at threshold $\rho^*$ forms a coherent assembly:
$$\mathcal{A}_q(t) = \{n_i \in S_q(t) : \forall n_j \in S_q(t),\; \rho_{ij}(t) \geq \rho^*\}$$
and the following hold:
(a) $\mathcal{A}_q(t)$ is non-empty whenever $S_q(t)$ is active.
(b) Distinct traversals $\mathcal{T}_q$ and $\mathcal{T}_r$ produce distinct coherent assemblies: $\mathcal{A}_q(t) \cap \mathcal{A}_r(t) \subsetneq \mathcal{A}_q(t)$ in general, even when $S_q \cap S_r \neq \emptyset$.
(c) The coherent assembly $\mathcal{A}_q$ uniquely identifies the traversal context for shared nodes — resolving superposition.
Part (a). For a traversal $\mathcal{T}_q$ to be active, there must exist at least one trigger node $n_0 \in S_q$. By definition, $\rho_{00}(t) = 1 \geq \rho^*$ trivially. For connected paths in $S_q$, the synaptic propagation of spikes induces correlated firing: if $n_i \to n_j$ is an active edge, the post-synaptic response at $n_j$ is a delayed version of $n_i$'s signal, and the phase difference $\theta_i(t) - \theta_j(t)$ converges to a constant phase lag $\Delta\phi_{ij}$ determined by the synaptic delay. By the theory of coupled oscillators (Kuramoto, 1984), this phase-locked state is stable when synaptic coupling strength exceeds the natural frequency dispersion: $K_{ij} > |\omega_i - \omega_j|$. We assume this condition holds for active edges by the weight maintenance property. Hence $\mathcal{A}_q(t) \supseteq \{n_0\}$, so it is non-empty.
Part (b). Two distinct traversals $\mathcal{T}_q$ and $\mathcal{T}_r$ are initiated at different times $t_q \neq t_r$ or different triggers. The oscillatory phase $\theta_i(t)$ of a shared node $n_s \in S_q \cap S_r$ advances continuously. Since $t_q \neq t_r$, the phases accumulated under $\mathcal{T}_q$ and $\mathcal{T}_r$ differ by $\Delta\phi = \omega_s(t_r - t_q)$. Therefore $\rho$ between $n_s$ under context $q$ and any node $n_j$ specific to $r$ is: $$\rho_{sj}^{(q)} = |\mathbb{E}[e^{i(\theta_s^{(q)} - \theta_j^{(r)})}]| = |\mathbb{E}[e^{i(\theta_s^{(r)} + \Delta\phi - \theta_j^{(r)})}]|$$ The phase offset $\Delta\phi$ desynchronizes $n_s$ from $\mathcal{A}_r$ if $\Delta\phi \notin 2\pi\mathbb{Z}$, which holds generically. Hence $\mathcal{A}_q \neq \mathcal{A}_r$.
Part (c). For shared node $n_s$, its membership in assembly $\mathcal{A}_q$ vs $\mathcal{A}_r$ is determined by its current phase $\theta_s(t)$, which is set by which traversal most recently drove its firing. The binding function $R(n_s, \mathcal{C}_s(t))$ thus evaluates differently in each traversal context — assigning high relevance to neighbors co-active in the same assembly and low relevance to neighbors in a different assembly. This provides the context-dependent probability modulation needed to resolve superposition, completing the proof. $\square$
Let memory $M$ be encoded as a reconstruction function $f_M$ distributed over neuron set $S_M$. If any $n_i \in S_M$ has its weights $w_i$ modified (for any purpose), then the reconstruction fidelity $\|f_M(k) - M\|$ increases for all memories $M'$ sharing neurons with $M$: $S_M \cap S_{M'} \neq \emptyset$. In particular, surgical deletion of a single memory is $\mathcal{NP}$-hard in the size of the overlap graph.
The reconstruction function $f_M$ involves weights along paths through $S_M$. Modifying weight $w_i$ perturbs all traversals passing through $n_i$. The problem of identifying which weights to modify to eliminate $M$ while preserving all $M'$ is equivalent to a minimum vertex cut problem on the overlap graph — which is $\mathcal{NP}$-hard by reduction from Minimum Vertex Cut on general graphs. $\square$
The second open problem was: how does the system detect when aggregate prediction error $\sum_i \varepsilon_i(t) \leq \varepsilon^*$ without a central aggregator? We prove that this termination condition emerges naturally from the local energy dynamics of the system via a Lyapunov argument.
For node $n_i$, define its local free energy at time $t$: $$F_i(t) = \underbrace{\mathbb{E}_{P_i}[\varepsilon_i]}_{\text{prediction error}} - \underbrace{H(P_i)}_{\text{exploration entropy}}$$ where $H(P_i) = -\sum_j P_i(n_j) \log P_i(n_j)$ is the Shannon entropy of the selection distribution. This is the variational free energy from the Free Energy Principle (Friston, 2010).
The global energy of the system at time $t$ over active traversal $\mathcal{T}_q$ is: $$\mathcal{F}(t) = \sum_{n_i \in S_q(t)} F_i(t) = \sum_{n_i \in S_q(t)} \bigl[\mathbb{E}_{P_i}[\varepsilon_i] - H(P_i)\bigr]$$
Suppose each node $n_i$ updates its weights by local gradient descent on its own free energy $F_i$: $$\dot{w}_i = -\eta_i \nabla_{w_i} F_i(t)$$ Then the global system energy $\mathcal{F}(t)$ is non-increasing along this dynamics: $$\frac{d\mathcal{F}}{dt} \leq 0$$ with equality if and only if $\nabla_{w_i} F_i = 0$ for all $n_i \in S_q(t)$.
Differentiate $\mathcal{F}(t)$ along the weight dynamics: $$\frac{d\mathcal{F}}{dt} = \sum_{n_i \in S_q} \frac{dF_i}{dt} = \sum_{n_i \in S_q} \left\langle \nabla_{w_i} F_i,\; \dot{w}_i \right\rangle$$ Substituting the local gradient descent rule $\dot{w}_i = -\eta_i \nabla_{w_i} F_i$: $$\frac{d\mathcal{F}}{dt} = \sum_{n_i \in S_q} \left\langle \nabla_{w_i} F_i,\; -\eta_i \nabla_{w_i} F_i \right\rangle = -\sum_{n_i \in S_q} \eta_i \|\nabla_{w_i} F_i\|^2 \leq 0$$ since $\eta_i > 0$ and $\|\cdot\|^2 \geq 0$. Equality holds iff all gradients vanish simultaneously. $\square$
The thought-traversal $\mathcal{T}_q$ self-terminates in finite time without a central coordinator. Specifically, under the local gradient dynamics, there exists finite $T^* < \infty$ such that for all $t \geq T^*$: $$\sum_{n_i \in S_q(t)} \varepsilon_i(t) \leq \varepsilon^*$$ Moreover, each node $n_i$ detects termination locally: $n_i$ terminates its participation when $|\dot{F}_i| < \delta$ for small $\delta > 0$ — no global threshold comparison is needed.
Existence of $T^*$. By Theorem 3.1, $\mathcal{F}(t)$ is a non-increasing function of $t$. We claim $\mathcal{F}$ is bounded below. Since $\varepsilon_i \geq 0$ always and $H(P_i) \leq \log|A(n_i)|$ (maximum entropy bound for finite neighborhood), we have: $$\mathcal{F}(t) \geq -\sum_{n_i \in S_q} \log|A(n_i)| > -\infty$$ By the Monotone Convergence Theorem for bounded-below non-increasing sequences, $\mathcal{F}(t)$ converges to a finite limit $\mathcal{F}^* \geq -\infty$. At convergence, $d\mathcal{F}/dt = 0$, which requires $\nabla_{w_i} F_i = 0$ for all $i$ (from the proof of Theorem 3.1). At critical points of $F_i$, the prediction error term $\mathbb{E}_{P_i}[\varepsilon_i]$ is minimized — this is precisely the condition $\sum_i \varepsilon_i \leq \varepsilon^*$ for appropriate $\varepsilon^*$.
Finite time. Under Lipschitz conditions on $F_i$ (which hold when $\varepsilon_i$ is differentiable in $w_i$, a standard regularity assumption), the gradient flow converges at rate $O(e^{-\eta t})$ to the fixed point, guaranteeing finite-time convergence to within $\varepsilon^*$.
Local detection. Node $n_i$ monitors $|\dot{F}_i(t)|$. By the chain rule: $$|\dot{F}_i| = |\langle \nabla_{w_i} F_i,\; \dot{w}_i \rangle| = \eta_i \|\nabla_{w_i} F_i\|^2$$ When $|\dot{F}_i| < \delta$, it follows that $\|\nabla_{w_i} F_i\| < \sqrt{\delta / \eta_i}$ — meaning local energy has locally converged. Each node can declare termination independently by monitoring its own free energy rate. When all nodes in $S_q$ terminate locally, the traversal has globally terminated — without any node needing knowledge of the global sum. $\square$
The energy consumed per thought-traversal is bounded by: $$E_{\mathcal{T}_q} = \int_0^{T^*} \sum_{n_i \in S_q(t)} c_i \cdot \mathbf{1}[\dot{\sigma}_i \neq 0]\, dt \leq C \cdot |S_q| \cdot T^*$$ where $c_i$ is the metabolic cost per spike and $C = \max_i c_i$. Since $|S_q| \ll |N|$ (sparse activation, typically 1–5% of neurons), and $T^*$ is short (tens to hundreds of milliseconds), the total energy across all concurrent traversals is consistent with the observed 20-watt metabolic budget.
The third open problem was the abstract node transformation function $\phi(x_i, w_i, d_i)$. We now specify it concretely as a Locally-Gated Recurrent Unit (L-GRU) — a variant of the standard GRU that operates on local information only — and prove its key computational properties.
The node transformation function $\phi$ for neuron $n_i$ is defined by the following gated dynamics: $$z_i(t) = \sigma_g\!\bigl(W_z x_i(t) + U_z d_i(t) + b_z\bigr) \quad \text{(update gate)}$$ $$r_i(t) = \sigma_g\!\bigl(W_r x_i(t) + U_r d_i(t) + b_r\bigr) \quad \text{(reset gate)}$$ $$\tilde{d}_i(t) = \tanh\!\bigl(W_h x_i(t) + U_h (r_i \odot d_i(t)) + b_h\bigr) \quad \text{(candidate state)}$$ $$d_i(t+1) = (1 - z_i(t)) \odot d_i(t) + z_i(t) \odot \tilde{d}_i(t) \quad \text{(state update)}$$ $$o_i(t) = W_o d_i(t+1) + b_o \quad \text{(output / next signal)}$$ where $\sigma_g$ is the sigmoid function, $\odot$ is elementwise product, and all weight matrices $(W_z, W_r, W_h, W_o, U_z, U_r, U_h)$ and biases constitute the node's weight vector $w_i$.
The gating mechanism provides the critical property: selective memory. The update gate $z_i$ controls how much past state $d_i$ to retain vs. how much to update from new input. The reset gate $r_i$ controls how much past state influences the candidate update. This allows $n_i$ to maintain long-term stable representations (closed gates) while remaining responsive to strong new signals (open gates) — precisely the behavior needed for both stable memory and rapid adaptation.
The L-GRU node $n_i$ can learn its optimal weight vector $w_i^*$ via gradient descent on its local free energy $F_i$ without requiring gradient information from non-neighboring nodes. Specifically: $$\nabla_{w_i} F_i = f(x_i, d_i, \varepsilon_i, w_i)$$ is expressible entirely in terms of $n_i$'s local state and the error signal $\varepsilon_i$ received from its immediate downstream neighbor.
The local free energy is $F_i = \mathbb{E}_{P_i}[\varepsilon_i] - H(P_i)$. We expand the prediction error term: $$\varepsilon_i = \|o_i(t) - x_j^{\text{actual}}(t+1)\|^2$$ where $x_j^{\text{actual}}$ is the actual input received by the selected next node $n_j$, which is returned to $n_i$ as a feedback message $m_{ji}^{\text{fb}}$. This feedback is a direct message from immediate neighbor $n_j \in A(n_i)$, available locally.
The gradient through the L-GRU is computed by backpropagation through time (BPTT) within $n_i$'s own recurrent state — this is entirely local. Explicitly: $$\frac{\partial \varepsilon_i}{\partial W_o} = 2(o_i - x_j^{\text{actual}}) \cdot d_i(t+1)^\top$$ $$\frac{\partial \varepsilon_i}{\partial W_h} = 2(o_i - x_j^{\text{actual}}) \cdot W_o^\top \cdot \frac{\partial d_i}{\partial W_h}$$ All terms involve only $o_i$, $d_i$, $x_i$, and $x_j^{\text{actual}}$ — all locally available at $n_i$. No gradient flows from non-neighbors. The entropy gradient $\nabla_{w_i} H(P_i)$ depends only on $P_i$ which is a function of $w_i$ — also local. Therefore the full gradient $\nabla_{w_i} F_i$ is local. $\square$
Different biological neuron types correspond to different parameter configurations of the L-GRU:
| Neuron Type | L-GRU Configuration | Functional Role | Update Gate Bias |
|---|---|---|---|
| Pyramidal (excitatory) | Large $W_o$, low $b_z$ (open update gate) | Forward signal propagation, integration | $b_z \approx -1$ (frequently update) |
| Interneuron (inhibitory) | Negative $W_o$, high $b_r$ (open reset gate) | Suppression, gain control, timing | $b_z \approx 0$ (context-sensitive) |
| Granule (encoding) | High $b_z$ (closed update gate), large $U_h$ | Stable memory storage, sparse coding | $b_z \approx +2$ (rarely update) |
| Purkinje (error) | $W_o$ tuned to $\varepsilon_i$ direction | Error signal computation and broadcast | Adaptive (error-driven) |
The L-GRU with hidden state dimension $d_i \geq K$ is a universal approximator for any continuous function $\phi: \mathbb{R}^{d_x} \times [0,T] \to \mathbb{R}^{d_o}$ on compact domain, for any $K$ depending only on the approximation error $\epsilon$ and the target function's smoothness. Therefore no neuron type requires a fundamentally different computational primitive — only different parameter configurations.
The fourth open problem was the developmental origin of the initial graph topology. We now provide a formal model grounded in random graph theory and the mathematics of genetic constraint satisfaction.
Let $\mathcal{G}$ denote the space of all directed graphs on $n$ nodes. The genetic encoding defines a probability distribution $\mathbb{P}_{\text{gen}}$ over $\mathcal{G}$ via a set of $m$ genetic constraint functions $g_k : \mathcal{G} \to \{0,1\}$, $k = 1, \ldots, m$: $$\mathbb{P}_{\text{gen}}(G) \propto \exp\!\left(-\sum_{k=1}^m \lambda_k g_k(G)\right) \cdot \mathbf{1}[G \text{ is connected}]$$ This is a Gibbs distribution over graphs with genetic penalty parameters $\lambda_k$ — a maximum entropy prior subject to genetic constraints.
The biologically motivated constraints include:
$g_1(G)$: Penalizes edges exceeding physical distance threshold $d_{\max}$ (axon length cost).
$g_2(G)$: Penalizes degree sequences deviating from power-law $P(\deg = k) \sim k^{-\alpha}$.
$g_3(G)$: Penalizes graphs without small-world property: $C(G) \ll C(\text{random})$ or $L(G) \gg L(\text{lattice})$.
$g_4(G)$: Rewards modular structure (high intra-cluster / low inter-cluster connectivity).
Under $\mathbb{P}_{\text{gen}}$ with the constraints of Definition 5.2, the expected initial graph $G_0 \sim \mathbb{P}_{\text{gen}}$ is a sparse small-world network satisfying: $$\mathbb{E}[C(G_0)] \gg C(G_{\text{random}}), \qquad \mathbb{E}[L(G_0)] \approx L(G_{\text{random}})$$ where $C(G)$ is the clustering coefficient and $L(G)$ is the average shortest path length.
The constraint $g_3$ directly penalizes deviation from the small-world regime. By the maximum entropy principle, the mode of $\mathbb{P}_{\text{gen}}$ is the distribution that satisfies all constraints while having maximum entropy — which is the Watts-Strogatz small-world model (Watts & Strogatz, 1998). Specifically, the Watts-Strogatz construction with rewiring probability $p_r \in (0.01, 0.1)$ achieves $C \gg C_{\text{random}}$ and $L \approx L_{\text{random}}$ simultaneously. Since our $g_3$ penalizes departures from this regime, the mode of $\mathbb{P}_{\text{gen}}$ concentrates near the Watts-Strogatz regime. The expectation follows by concentration of the Gibbs measure around its mode for large $n$ and positive $\lambda_k$. $\square$
Starting from $G_0 \sim \mathbb{P}_{\text{gen}}$, the graph evolves under the Spike-Timing Dependent Plasticity (STDP) rule: $$\Delta W(n_i \to n_j) = \begin{cases} A_+ \exp\!\left(-\frac{\Delta t}{\tau_+}\right) & \text{if } \Delta t = t_j - t_i > 0 \\ -A_- \exp\!\left(\frac{\Delta t}{\tau_-}\right) & \text{if } \Delta t < 0 \end{cases}$$ where $\Delta t$ is the relative spike timing. Edge $(n_i, n_j)$ is created when $W(n_i \to n_j)$ exceeds creation threshold $W_+$, and pruned when it falls below $W_- < 0$.
Under the STDP evolution from initial graph $G_0$, the graph $G(t)$ converges almost surely to a fixed-point topology $G^* = G(\infty)$ that: (a) encodes all experienced input statistics in its weight matrix $W^*$; (b) has strictly higher clustering coefficient than $G_0$: $C(G^*) \geq C(G_0)$; (c) has pruned all edges not reinforced by experience, reducing metabolic cost.
Define the stochastic process $\{W(t)\}_{t \geq 0}$ on the weight matrix. Under STDP, this is a stochastic differential equation: $$dW_{ij} = f_{\text{STDP}}(W_{ij}, \sigma_i, \sigma_j)\, dt + \xi_{ij}(t)\, dB_t$$ where $\xi_{ij}$ is noise from stochastic spiking and $B_t$ is a Brownian motion. By Lyapunov methods applied to this SDE, the drift term $f_{\text{STDP}}$ has the form of a mean-reverting process toward weight values that maximize the mutual information $I(\sigma_i; \sigma_j)$ between spike trains. This has a unique stable equilibrium at $W^* = \arg\max I(\sigma_i; \sigma_j)$ (Oja's rule convergence, 1982). The clustering coefficient increases because STDP preferentially strengthens edges within frequently co-activated groups — the triangles in the graph — directly increasing $C$. Edge pruning follows from the negative $A_-$ term eliminating edges with consistent anti-causal timing. Almost-sure convergence follows from the Borel-Cantelli lemma applied to the supermartingale $\|W(t) - W^*\|$. $\square$
We derive from first principles the necessary and sufficient conditions a physical substrate must satisfy to implement DAN Theory. These are not engineering preferences but mathematical necessities implied by the formal model.
Any physical substrate $\mathcal{H}$ that faithfully implements DAN Theory must satisfy all of the following:
(H1) Continuous state representation. $\mathcal{H}$ must support node states $s_i \in \mathbb{R}^d$ with $d \geq 4$ (input, weight, data, distribution). This rules out purely binary/digital nodes.
(H2) Spike timing resolution. $\mathcal{H}$ must resolve spike timing differences to precision $\delta t \leq 1/f_\gamma \approx 12$ ms for $\gamma$-band binding. This requires temporal resolution at the millisecond scale.
(H3) Local weight persistence. Each node $n_i$ must independently store and update its weight vector $w_i$ without global write access. This requires distributed non-volatile storage, not a shared parameter server.
(H4) Asynchronous event-driven computation. Nodes must compute only when receiving input (event-driven), not on a global clock. This is required for sparse activation and the $O(|S_q|)$ energy bound of Corollary 3.1.
(H5) Bidirectional communication. Each edge must support both forward signal propagation and backward error feedback, at minimum between immediate neighbors.
Each condition follows from a specific part of the formal model:
H1: The state quadruple $(x_i, w_i, d_i, P_i)$ is defined over continuous spaces in Definition 1.2. A binary node cannot represent the L-GRU internal state (Definition 4.1).
H2: Theorem 2.2 requires computing phase coherence $\rho_{ij}(t)$ at $\gamma$-band frequency. Resolving $e^{i(\theta_i - \theta_j)}$ requires timing precision $\delta t \ll 1/40 \text{ Hz} = 25$ ms; conservatively $\delta t \leq 12$ ms.
H3: Theorem 4.1 proves local learnability requires each node to independently store and update $w_i$. A shared weight store violates local autonomy (Definition 1.4).
H4: Corollary 3.1 bounds energy by $O(|S_q|)$ active nodes. A globally clocked system computes over all $n$ nodes per cycle, consuming $O(n) \gg O(|S_q|)$ energy — inconsistent with the 20-watt bound.
H5: Theorem 3.2 requires each node to receive error feedback $\varepsilon_i$ from immediate downstream neighbors. This requires bidirectional edge communication. $\square$
Conditions H1–H5 are also sufficient: any substrate satisfying H1–H5 can implement DAN Theory up to approximation error $\epsilon > 0$ (from Proposition 4.1).
Given H1–H5, construct the DAN implementation as follows: (a) represent each node state $(x_i, w_i, d_i, P_i)$ in the continuous state space guaranteed by H1; (b) use the temporal resolution of H2 to compute phase coherence $\rho_{ij}$ via Definition 2.2; (c) use the local weight storage of H3 to implement the L-GRU weight updates of Theorem 4.1; (d) use the event-driven property of H4 to trigger computation only on spike arrival; (e) use the bidirectional edges of H5 to propagate forward signals and backward error. By Proposition 4.1, the L-GRU approximates any continuous node function to within $\epsilon$. All theorems (2.1, 2.2, 3.1, 3.2, 4.1) hold exactly under this construction. $\square$
| Condition | Current Technology | Gap | Candidate Solution |
|---|---|---|---|
| H1: Continuous states | Analog circuits, memristors | Noise; fabrication precision | Memristive crossbar arrays |
| H2: ms timing resolution | FPGA (ns), Loihi (1ms) | Intel Loihi meets this requirement | Loihi 2 (resolved) |
| H3: Distributed local weights | On-chip SRAM per core | Limited capacity; no online STDP at scale | Phase-change memory per node |
| H4: Asynchronous event-driven | Neuromorphic chips | Loihi/BrainScaleS partially satisfy this | Loihi 2, SpiNNaker2 (partial) |
| H5: Bidirectional edges | Not in any current chip | All current chips are feedforward only | Open research problem |
The analysis reveals that H5 is the hardest unsatisfied condition. No current neuromorphic architecture supports on-chip bidirectional error signaling per edge at scale. This is the primary hardware research direction implied by DAN Theory.
A theory is only as strong as its falsifiable predictions. We derive from the formal model a set of specific, measurable predictions that distinguish DAN Theory from competing frameworks.
From Theorem 2.2, the probability that neuron $n_j$ is activated following $n_i$ should be a monotonically increasing function of their phase coherence $\rho_{ij}(t)$ at $\gamma$-band: $$\frac{\partial}{\partial \rho_{ij}} P_i(n_j \mid \mathcal{C}_i) > 0 \quad \text{(Prediction P1)}$$ Experimental test: Multi-electrode array recordings during cognitive tasks. Measure $\rho_{ij}$ between candidate neurons 50ms before activation. Test whether high-$\rho$ pairs activate more often than low-$\rho$ pairs, controlling for synaptic weight.
From Theorem 3.2, the end of a cognitive task (response time) should correlate with the time at which $|\dot{F}_i|$ drops below threshold across all active neurons — measurable as a sudden drop in $\gamma$-power variance: $$t_{\text{response}} \approx \min\{t : \text{Var}[|\dot{\sigma}_i(t)|_{i \in S_q}] < \delta\} \quad \text{(Prediction P2)}$$ Experimental test: EEG/MEG recordings during decision tasks. Response time should correlate ($r > 0.7$) with the time of $\gamma$-power variance collapse, not with mean $\gamma$-power.
From Corollary 2.1, the interference between two memories $M$ and $M'$ should scale with $|S_M \cap S_{M'}|$: $$\text{Interference}(M, M') \propto \frac{|S_M \cap S_{M'}|}{|S_M| + |S_{M'}|} \quad \text{(Prediction P3)}$$ Experimental test: fMRI BOLD signal during dual-memory encoding tasks. Pairs of semantically similar memories (higher predicted overlap) should show more interference than semantically distant pairs. The interference coefficient should match the predicted proportionality.
From the parallel traversal temperature model (Section 7 of the original paper), attentional capture should produce a measurable redistribution of $\gamma$-power from the interrupted task's neural population to the capturing task's population, conserving total power: $$\sum_q \text{Power}_q^{(\gamma)}(t) = \text{const} \quad \text{(Prediction P4)}$$ Experimental test: EEG during dual-task experiments with attentional interruption. Verify conservation of total $\gamma$-power under attentional shift, against alternative models that predict net $\gamma$-power increase.
The binding threshold $\rho^*$ and the coherence influence parameter $\beta$ in Definition 2.4 are free parameters of the theory. Their values must be estimated from empirical data. We predict $\rho^* \in (0.4, 0.7)$ and $\beta \in (1, 5)$ based on analogy with Kuramoto coupling constants, but precise estimation requires multi-unit recording studies with simultaneous LFP and spike train measurements.
We now state the central result unifying all components of the framework.
Let $G = (N, E, W, T)$ be a dynamic directed graph satisfying hardware conditions H1–H5 (Theorem 6.1), initialized under the genetic prior $\mathbb{P}_{\text{gen}}$ (Definition 5.1). Let each node $n_i$ implement the L-GRU dynamics (Definition 4.1) with local gradient updates, and let the binding function $R$ be the phase-coherence operator of Definition 2.4.
Then the following hold simultaneously:
(1) Binding without central coordination. For any traversal $\mathcal{T}_q$, the coherent assembly $\mathcal{A}_q(t)$ uniquely identifies traversal context for every shared node (Theorem 2.2), resolving superposition without global state.
(2) Self-termination. Every traversal $\mathcal{T}_q$ terminates in finite time $T^* < \infty$ through local free energy minimization alone (Theorem 3.2).
(3) Local learnability. Every node $n_i$ learns its optimal parameters $w_i^*$ via local gradient descent on $F_i$ without non-local gradient information (Theorem 4.1).
(4) Developmental convergence. Starting from sparse random topology $G_0$, the graph converges to an experience-encoded fixed point $G^*$ under STDP (Theorem 5.2).
(5) Energy efficiency. Total metabolic energy per traversal is $O(|S_q| \cdot T^*)$ — consistent with the 20-watt budget for typical $|S_q| \ll |N|$ (Corollary 3.1).
(6) Universal approximation. The system can approximate any continuous cognitive mapping $f: \mathcal{X} \to \mathcal{Y}$ to within $\epsilon > 0$ given sufficient node count $n$ and hidden dimension $d$ (Proposition 4.1 extended to graph level).
Statements (1)–(5) follow directly from the cited theorems, which have been proved independently above. The conditions of each theorem are satisfied by assumption: H1–H5 satisfy the hardware requirements; L-GRU satisfies the node function requirements; phase-coherence satisfies the binding requirements; STDP satisfies the developmental requirements.
Statement (6): By Proposition 4.1, each L-GRU node is a universal approximator for continuous functions on compact domains. A graph of such nodes with sufficient connectivity and depth can approximate any continuous mapping $f: \mathcal{X} \to \mathcal{Y}$ by the universal approximation theorem for deep networks (Cybenko, 1989; Hornik, 1991) extended to recurrent architectures (Schäfer & Zimmermann, 2006). The required conditions (depth, width, activation functions) are satisfied by the L-GRU configuration space.
The six statements are mutually consistent because: (1) does not require global state; (2) requires only local monitoring; (3) requires only neighbor feedback; (4) is a separate timescale (developmental) from (1)–(3) (operational); (5) follows from the sparsity of each individual proof's requirements. No statement assumes conditions that contradict another. $\square$
No architecture that violates any of H1–H5 can implement DAN Theory and therefore cannot replicate biological cognition. Current transformer-based LLMs violate H1 (discrete token states), H3 (global shared parameter server), H4 (synchronous batched computation), and H5 (no backward error signaling per node). Therefore current LLM architectures are provably insufficient for biological-equivalent general intelligence, regardless of scale.
Theorem 8.1 fully characterizes cognition and memory in the DAN framework. It does not characterize consciousness. We conjecture that consciousness corresponds to a meta-stable high-coherence state: $\exists\, S^* \subseteq N$ such that $\mathcal{A}_{S^*}(t) = S^*$ (the assembly is self-sustaining) and $|S^*| > \Omega(|N|^{\alpha})$ for some $\alpha \in (0,1)$. Formalizing and proving this conjecture — relating it to Integrated Information Theory or Global Workspace Theory — remains the primary open problem of the complete DAN framework.
Mathematical Foundations Extension. Preprint. Not peer reviewed.
Correspondence: Bharat Rawat · India
© 2026 Bharat Rawat. This work may be freely cited with attribution.