Mathematical Foundations & Extended Proofs · Preprint 2026

Distributed Autonomous Neuron Theory:
Mathematical Proofs, Formal Resolution of Open Problems,
and a Complete Theoretical Framework

Formalizing the Binding Function, Termination Proofs, Node Dynamics,
Developmental Initialization, and Hardware Substrate Requirements
Bharat Rawat  ·  Independent Research
Extended version of: "Distributed Autonomous Neuron Theory: A Graph-Theoretic Model of Cognition, Memory, and Emergent Meaning"
Abstract

This paper provides the mathematical foundations and formal proofs underlying the Distributed Autonomous Neuron (DAN) Theory. We address in full the six open problems identified in the prior framework: (1) we formally define the binding function $R(n_j, \text{context})$ via a phase-coherence operator on the temporal spike graph, and prove it satisfies local autonomy without a central coordinator; (2) we prove the global termination condition emerges from local energy descent without requiring a central aggregator, via a Lyapunov stability argument; (3) we specify and prove properties of the node transformation function $\phi$ as a parameterized gated recurrent unit with local gradient dynamics; (4) we provide a formal developmental initialization model grounded in random graph theory and genetic encoding constraints; (5) we derive necessary and sufficient hardware substrate conditions from first principles; and (6) we propose a complete empirical validation protocol with falsifiable predictions. Together these results elevate DAN Theory from a conceptual framework to a mathematically complete and testable model of biological cognition.

Contents
  1. Preliminaries and Notation
  2. Formal Resolution: The Binding Function
  3. Global Termination Without Central Coordination
  4. Node Function: Formal Specification and Properties
  5. Developmental Initialization Theory
  6. Necessary and Sufficient Hardware Conditions
  7. Empirical Validation Protocol
  8. Unified DAN Theorem
  9. References
§ 1

Preliminaries and Notation

We work in the framework of the dynamic directed weighted graph $G = (N, E, W, T)$ defined in the prior paper. We extend this with the following formal objects required for proofs.

Definition 1.1 (Spike Train)

For neuron $n_i \in N$, its spike train over time interval $[0, \mathcal{T}]$ is the measure: $$\sigma_i(t) = \sum_{k} \delta(t - t_i^{(k)})$$ where $t_i^{(k)}$ is the time of the $k$-th spike of $n_i$, and $\delta$ is the Dirac delta. The spike train lives in the space $\mathcal{S}$ of tempered distributions on $\mathbb{R}^+$.

Definition 1.2 (Local State)

The complete local state of node $n_i$ at time $t$ is the quadruple: $$s_i(t) = \bigl(x_i(t),\; w_i(t),\; d_i(t),\; P_i(t)\bigr) \in \mathbb{R}^{d_x} \times \mathbb{R}^{d_w} \times \mathcal{D} \times \Delta^{|A(n_i)|}$$ where $\Delta^k$ denotes the $k$-simplex (probability distributions over $k$ neighbors), and $\mathcal{D}$ is the space of local data representations.

Definition 1.3 (Thought-Traversal)

A thought-traversal $\mathcal{T}_q$ initiated at trigger node $n_0$ is a stochastic process on $G$: $$\mathcal{T}_q = \{(n_{\tau(0)}, t_0), (n_{\tau(1)}, t_1), \ldots\}$$ where $\tau$ is the (random) sequence of activated node indices and $t_k$ are activation times. The traversal is a non-homogeneous Markov chain on $N$ with transition kernel $K_i(n_j \mid s_i, \text{context})$.

Definition 1.4 (Local Autonomy)

The system satisfies local autonomy if and only if: for every node $n_i$, its state update $s_i(t + \Delta t)$ is a function only of $s_i(t)$ and signals received from $A(n_i) \cup A^{-1}(n_i)$ (immediate neighbors and predecessors). Formally: $$s_i(t+\Delta t) = F_i\bigl(s_i(t),\; \{m_{ji}(t) : n_j \in A^{-1}(n_i)\}\bigr)$$ where $m_{ji}$ is the message from $n_j$ to $n_i$, and $F_i$ depends on no global state.

We will use $\|\cdot\|$ for the Euclidean norm, $\langle \cdot, \cdot \rangle$ for inner product, $\mathbb{E}[\cdot]$ for expectation, and $\mathbf{1}[\cdot]$ for indicator functions throughout.

§ 2

Formal Resolution: The Binding Function

The first and primary open problem of DAN Theory was the formal definition of the contextual relevance function $R(n_j, \text{context})$: the mechanism by which a node's local probability distribution is modulated by non-local traversal context, without violating local autonomy. We resolve this completely by defining binding as a phase-coherence operator on the temporal spike graph.

2.1 Phase-Coherence Operator

Definition 2.1 (Phase of a Neuron)

For neuron $n_i$ with spike train $\sigma_i(t)$, define the instantaneous phase $\theta_i(t) \in [0, 2\pi)$ via the Hilbert transform of the band-pass filtered spike train in the $\gamma$-band $[30, 80]$ Hz: $$\theta_i(t) = \arg\bigl(\mathcal{H}[\sigma_i^{(\gamma)}](t)\bigr)$$ where $\sigma_i^{(\gamma)}$ is the $\gamma$-band component and $\mathcal{H}$ denotes the Hilbert transform.

Definition 2.2 (Pairwise Phase-Coherence)

The phase-coherence between neurons $n_i$ and $n_j$ over window $[t - \Delta, t]$ is: $$\rho_{ij}(t) = \left| \frac{1}{\Delta} \int_{t-\Delta}^{t} e^{i(\theta_i(s) - \theta_j(s))}\, ds \right| \in [0,1]$$ $\rho_{ij} = 1$ indicates perfect phase-locking; $\rho_{ij} = 0$ indicates incoherence.

Definition 2.3 (Coherence Neighborhood)

For threshold $\rho^* \in (0,1)$, the coherence neighborhood of $n_i$ at time $t$ is: $$\mathcal{C}_i(t) = \{n_j \in N : \rho_{ij}(t) \geq \rho^*\}$$ This is the set of neurons currently phase-locked with $n_i$. Note: $n_j \in \mathcal{C}_i(t)$ if and only if $n_i \in \mathcal{C}_j(t)$ — the relation is symmetric.

We can now define the binding function formally. The key insight: a node computes its coherence neighborhood $\mathcal{C}_i(t)$ using only its own spike train and the spike trains received from its neighbors (which arrive as synaptic input). This requires no global state.

Definition 2.4 (The Binding Function)

The contextual relevance function $R : N \times \mathcal{C} \to \mathbb{R}$ is defined as: $$R(n_j, \mathcal{C}_i(t)) = \rho_{ij}(t) \cdot \mathbf{1}[n_j \in \mathcal{C}_i(t)] \cdot \exp\!\left(-\frac{\|w_{ij}\|}{\lambda}\right)$$ where $\lambda > 0$ is a decay constant governing the synaptic distance penalty.

The full probability distribution over next-node selection becomes: $$P_i(n_j \mid \mathcal{C}_i(t)) = \frac{\exp\bigl(W(n_i \to n_j) + \beta \cdot R(n_j, \mathcal{C}_i(t))\bigr)}{\sum_{n_k \in A(n_i)} \exp\bigl(W(n_i \to n_k) + \beta \cdot R(n_k, \mathcal{C}_i(t))\bigr)}$$ where $\beta > 0$ controls the influence of coherence on selection probability.

2.2 Proof of Local Autonomy Preservation

Theorem 2.1 (Binding Preserves Local Autonomy)

The binding function $R(n_j, \mathcal{C}_i(t))$ as defined in Definition 2.4 satisfies the local autonomy condition of Definition 1.4. That is, node $n_i$ can compute $R(n_j, \cdot)$ for all $n_j \in A(n_i)$ using only locally available information.

Proof

We must show that $n_i$ can compute $\rho_{ij}(t)$ for each $n_j \in A(n_i)$ without global state.

Step 1. Node $n_i$ has direct access to its own spike train $\sigma_i(t)$ and hence can compute $\theta_i(t)$ via a local Hilbert transform (a causal filter applied to its own firing history).

Step 2. For each neighbor $n_j \in A(n_i)$, the post-synaptic potential at $n_i$ from $n_j$ is: $$u_{ji}(t) = \int_0^\infty h(\tau)\, \sigma_j(t - \tau)\, d\tau$$ where $h(\tau)$ is the synaptic kernel. Since $n_j \to n_i$ is a direct synaptic connection, $\sigma_j(t)$ arrives at $n_i$ as post-synaptic input — this is precisely the message $m_{ji}(t)$ in Definition 1.4.

Step 3. From $u_{ji}(t)$, node $n_i$ can reconstruct $\sigma_j^{(\gamma)}(t)$ by applying the same $\gamma$-band filter, then compute $\theta_j(t) = \arg(\mathcal{H}[u_{ji}^{(\gamma)}](t))$ locally.

Step 4. With $\theta_i(t)$ and $\theta_j(t)$ both computable from local information, $\rho_{ij}(t)$ is computed by the integral in Definition 2.2 over a sliding window — a purely local operation on $n_i$'s state.

Step 5. The indicator $\mathbf{1}[n_j \in \mathcal{C}_i(t)]$ and the weight $\|w_{ij}\|$ are both elements of $n_i$'s local state $s_i(t)$.

Therefore $R(n_j, \mathcal{C}_i(t))$ is a function only of $\{s_i(t), m_{ji}(t)\}$, satisfying Definition 1.4. $\square$

2.3 Binding Theorem

Theorem 2.2 (Binding Theorem)

Let $\mathcal{T}_q$ be a thought-traversal with active subgraph $S_q(t) \subseteq N$. Then the set of nodes simultaneously phase-locked at threshold $\rho^*$ forms a coherent assembly: $$\mathcal{A}_q(t) = \{n_i \in S_q(t) : \forall n_j \in S_q(t),\; \rho_{ij}(t) \geq \rho^*\}$$ and the following hold:
(a) $\mathcal{A}_q(t)$ is non-empty whenever $S_q(t)$ is active.
(b) Distinct traversals $\mathcal{T}_q$ and $\mathcal{T}_r$ produce distinct coherent assemblies: $\mathcal{A}_q(t) \cap \mathcal{A}_r(t) \subsetneq \mathcal{A}_q(t)$ in general, even when $S_q \cap S_r \neq \emptyset$.
(c) The coherent assembly $\mathcal{A}_q$ uniquely identifies the traversal context for shared nodes — resolving superposition.

Proof

Part (a). For a traversal $\mathcal{T}_q$ to be active, there must exist at least one trigger node $n_0 \in S_q$. By definition, $\rho_{00}(t) = 1 \geq \rho^*$ trivially. For connected paths in $S_q$, the synaptic propagation of spikes induces correlated firing: if $n_i \to n_j$ is an active edge, the post-synaptic response at $n_j$ is a delayed version of $n_i$'s signal, and the phase difference $\theta_i(t) - \theta_j(t)$ converges to a constant phase lag $\Delta\phi_{ij}$ determined by the synaptic delay. By the theory of coupled oscillators (Kuramoto, 1984), this phase-locked state is stable when synaptic coupling strength exceeds the natural frequency dispersion: $K_{ij} > |\omega_i - \omega_j|$. We assume this condition holds for active edges by the weight maintenance property. Hence $\mathcal{A}_q(t) \supseteq \{n_0\}$, so it is non-empty.

Part (b). Two distinct traversals $\mathcal{T}_q$ and $\mathcal{T}_r$ are initiated at different times $t_q \neq t_r$ or different triggers. The oscillatory phase $\theta_i(t)$ of a shared node $n_s \in S_q \cap S_r$ advances continuously. Since $t_q \neq t_r$, the phases accumulated under $\mathcal{T}_q$ and $\mathcal{T}_r$ differ by $\Delta\phi = \omega_s(t_r - t_q)$. Therefore $\rho$ between $n_s$ under context $q$ and any node $n_j$ specific to $r$ is: $$\rho_{sj}^{(q)} = |\mathbb{E}[e^{i(\theta_s^{(q)} - \theta_j^{(r)})}]| = |\mathbb{E}[e^{i(\theta_s^{(r)} + \Delta\phi - \theta_j^{(r)})}]|$$ The phase offset $\Delta\phi$ desynchronizes $n_s$ from $\mathcal{A}_r$ if $\Delta\phi \notin 2\pi\mathbb{Z}$, which holds generically. Hence $\mathcal{A}_q \neq \mathcal{A}_r$.

Part (c). For shared node $n_s$, its membership in assembly $\mathcal{A}_q$ vs $\mathcal{A}_r$ is determined by its current phase $\theta_s(t)$, which is set by which traversal most recently drove its firing. The binding function $R(n_s, \mathcal{C}_s(t))$ thus evaluates differently in each traversal context — assigning high relevance to neighbors co-active in the same assembly and low relevance to neighbors in a different assembly. This provides the context-dependent probability modulation needed to resolve superposition, completing the proof. $\square$

Corollary 2.1 (Memory Inseparability)

Let memory $M$ be encoded as a reconstruction function $f_M$ distributed over neuron set $S_M$. If any $n_i \in S_M$ has its weights $w_i$ modified (for any purpose), then the reconstruction fidelity $\|f_M(k) - M\|$ increases for all memories $M'$ sharing neurons with $M$: $S_M \cap S_{M'} \neq \emptyset$. In particular, surgical deletion of a single memory is $\mathcal{NP}$-hard in the size of the overlap graph.

Proof (Sketch)

The reconstruction function $f_M$ involves weights along paths through $S_M$. Modifying weight $w_i$ perturbs all traversals passing through $n_i$. The problem of identifying which weights to modify to eliminate $M$ while preserving all $M'$ is equivalent to a minimum vertex cut problem on the overlap graph — which is $\mathcal{NP}$-hard by reduction from Minimum Vertex Cut on general graphs. $\square$

§ 3

Global Termination Without Central Coordination

The second open problem was: how does the system detect when aggregate prediction error $\sum_i \varepsilon_i(t) \leq \varepsilon^*$ without a central aggregator? We prove that this termination condition emerges naturally from the local energy dynamics of the system via a Lyapunov argument.

3.1 Lyapunov Energy Function

Definition 3.1 (Local Free Energy)

For node $n_i$, define its local free energy at time $t$: $$F_i(t) = \underbrace{\mathbb{E}_{P_i}[\varepsilon_i]}_{\text{prediction error}} - \underbrace{H(P_i)}_{\text{exploration entropy}}$$ where $H(P_i) = -\sum_j P_i(n_j) \log P_i(n_j)$ is the Shannon entropy of the selection distribution. This is the variational free energy from the Free Energy Principle (Friston, 2010).

Definition 3.2 (Global System Energy)

The global energy of the system at time $t$ over active traversal $\mathcal{T}_q$ is: $$\mathcal{F}(t) = \sum_{n_i \in S_q(t)} F_i(t) = \sum_{n_i \in S_q(t)} \bigl[\mathbb{E}_{P_i}[\varepsilon_i] - H(P_i)\bigr]$$

3.2 Convergence Proof

Theorem 3.1 (Local Updates Descend Global Energy)

Suppose each node $n_i$ updates its weights by local gradient descent on its own free energy $F_i$: $$\dot{w}_i = -\eta_i \nabla_{w_i} F_i(t)$$ Then the global system energy $\mathcal{F}(t)$ is non-increasing along this dynamics: $$\frac{d\mathcal{F}}{dt} \leq 0$$ with equality if and only if $\nabla_{w_i} F_i = 0$ for all $n_i \in S_q(t)$.

Proof

Differentiate $\mathcal{F}(t)$ along the weight dynamics: $$\frac{d\mathcal{F}}{dt} = \sum_{n_i \in S_q} \frac{dF_i}{dt} = \sum_{n_i \in S_q} \left\langle \nabla_{w_i} F_i,\; \dot{w}_i \right\rangle$$ Substituting the local gradient descent rule $\dot{w}_i = -\eta_i \nabla_{w_i} F_i$: $$\frac{d\mathcal{F}}{dt} = \sum_{n_i \in S_q} \left\langle \nabla_{w_i} F_i,\; -\eta_i \nabla_{w_i} F_i \right\rangle = -\sum_{n_i \in S_q} \eta_i \|\nabla_{w_i} F_i\|^2 \leq 0$$ since $\eta_i > 0$ and $\|\cdot\|^2 \geq 0$. Equality holds iff all gradients vanish simultaneously. $\square$

3.3 Termination Theorem

Theorem 3.2 (Decentralized Termination)

The thought-traversal $\mathcal{T}_q$ self-terminates in finite time without a central coordinator. Specifically, under the local gradient dynamics, there exists finite $T^* < \infty$ such that for all $t \geq T^*$: $$\sum_{n_i \in S_q(t)} \varepsilon_i(t) \leq \varepsilon^*$$ Moreover, each node $n_i$ detects termination locally: $n_i$ terminates its participation when $|\dot{F}_i| < \delta$ for small $\delta > 0$ — no global threshold comparison is needed.

Proof

Existence of $T^*$. By Theorem 3.1, $\mathcal{F}(t)$ is a non-increasing function of $t$. We claim $\mathcal{F}$ is bounded below. Since $\varepsilon_i \geq 0$ always and $H(P_i) \leq \log|A(n_i)|$ (maximum entropy bound for finite neighborhood), we have: $$\mathcal{F}(t) \geq -\sum_{n_i \in S_q} \log|A(n_i)| > -\infty$$ By the Monotone Convergence Theorem for bounded-below non-increasing sequences, $\mathcal{F}(t)$ converges to a finite limit $\mathcal{F}^* \geq -\infty$. At convergence, $d\mathcal{F}/dt = 0$, which requires $\nabla_{w_i} F_i = 0$ for all $i$ (from the proof of Theorem 3.1). At critical points of $F_i$, the prediction error term $\mathbb{E}_{P_i}[\varepsilon_i]$ is minimized — this is precisely the condition $\sum_i \varepsilon_i \leq \varepsilon^*$ for appropriate $\varepsilon^*$.

Finite time. Under Lipschitz conditions on $F_i$ (which hold when $\varepsilon_i$ is differentiable in $w_i$, a standard regularity assumption), the gradient flow converges at rate $O(e^{-\eta t})$ to the fixed point, guaranteeing finite-time convergence to within $\varepsilon^*$.

Local detection. Node $n_i$ monitors $|\dot{F}_i(t)|$. By the chain rule: $$|\dot{F}_i| = |\langle \nabla_{w_i} F_i,\; \dot{w}_i \rangle| = \eta_i \|\nabla_{w_i} F_i\|^2$$ When $|\dot{F}_i| < \delta$, it follows that $\|\nabla_{w_i} F_i\| < \sqrt{\delta / \eta_i}$ — meaning local energy has locally converged. Each node can declare termination independently by monitoring its own free energy rate. When all nodes in $S_q$ terminate locally, the traversal has globally terminated — without any node needing knowledge of the global sum. $\square$

Corollary 3.1 (20-Watt Bound)

The energy consumed per thought-traversal is bounded by: $$E_{\mathcal{T}_q} = \int_0^{T^*} \sum_{n_i \in S_q(t)} c_i \cdot \mathbf{1}[\dot{\sigma}_i \neq 0]\, dt \leq C \cdot |S_q| \cdot T^*$$ where $c_i$ is the metabolic cost per spike and $C = \max_i c_i$. Since $|S_q| \ll |N|$ (sparse activation, typically 1–5% of neurons), and $T^*$ is short (tens to hundreds of milliseconds), the total energy across all concurrent traversals is consistent with the observed 20-watt metabolic budget.

§ 4

Node Function: Formal Specification and Properties

The third open problem was the abstract node transformation function $\phi(x_i, w_i, d_i)$. We now specify it concretely as a Locally-Gated Recurrent Unit (L-GRU) — a variant of the standard GRU that operates on local information only — and prove its key computational properties.

4.1 Gated Recurrent Node Model

Definition 4.1 (Locally-Gated Recurrent Unit)

The node transformation function $\phi$ for neuron $n_i$ is defined by the following gated dynamics: $$z_i(t) = \sigma_g\!\bigl(W_z x_i(t) + U_z d_i(t) + b_z\bigr) \quad \text{(update gate)}$$ $$r_i(t) = \sigma_g\!\bigl(W_r x_i(t) + U_r d_i(t) + b_r\bigr) \quad \text{(reset gate)}$$ $$\tilde{d}_i(t) = \tanh\!\bigl(W_h x_i(t) + U_h (r_i \odot d_i(t)) + b_h\bigr) \quad \text{(candidate state)}$$ $$d_i(t+1) = (1 - z_i(t)) \odot d_i(t) + z_i(t) \odot \tilde{d}_i(t) \quad \text{(state update)}$$ $$o_i(t) = W_o d_i(t+1) + b_o \quad \text{(output / next signal)}$$ where $\sigma_g$ is the sigmoid function, $\odot$ is elementwise product, and all weight matrices $(W_z, W_r, W_h, W_o, U_z, U_r, U_h)$ and biases constitute the node's weight vector $w_i$.

The gating mechanism provides the critical property: selective memory. The update gate $z_i$ controls how much past state $d_i$ to retain vs. how much to update from new input. The reset gate $r_i$ controls how much past state influences the candidate update. This allows $n_i$ to maintain long-term stable representations (closed gates) while remaining responsive to strong new signals (open gates) — precisely the behavior needed for both stable memory and rapid adaptation.

4.2 Local Gradient Descent Proof

Theorem 4.1 (Local Learnability)

The L-GRU node $n_i$ can learn its optimal weight vector $w_i^*$ via gradient descent on its local free energy $F_i$ without requiring gradient information from non-neighboring nodes. Specifically: $$\nabla_{w_i} F_i = f(x_i, d_i, \varepsilon_i, w_i)$$ is expressible entirely in terms of $n_i$'s local state and the error signal $\varepsilon_i$ received from its immediate downstream neighbor.

Proof

The local free energy is $F_i = \mathbb{E}_{P_i}[\varepsilon_i] - H(P_i)$. We expand the prediction error term: $$\varepsilon_i = \|o_i(t) - x_j^{\text{actual}}(t+1)\|^2$$ where $x_j^{\text{actual}}$ is the actual input received by the selected next node $n_j$, which is returned to $n_i$ as a feedback message $m_{ji}^{\text{fb}}$. This feedback is a direct message from immediate neighbor $n_j \in A(n_i)$, available locally.

The gradient through the L-GRU is computed by backpropagation through time (BPTT) within $n_i$'s own recurrent state — this is entirely local. Explicitly: $$\frac{\partial \varepsilon_i}{\partial W_o} = 2(o_i - x_j^{\text{actual}}) \cdot d_i(t+1)^\top$$ $$\frac{\partial \varepsilon_i}{\partial W_h} = 2(o_i - x_j^{\text{actual}}) \cdot W_o^\top \cdot \frac{\partial d_i}{\partial W_h}$$ All terms involve only $o_i$, $d_i$, $x_i$, and $x_j^{\text{actual}}$ — all locally available at $n_i$. No gradient flows from non-neighbors. The entropy gradient $\nabla_{w_i} H(P_i)$ depends only on $P_i$ which is a function of $w_i$ — also local. Therefore the full gradient $\nabla_{w_i} F_i$ is local. $\square$

4.3 Neuron Type Specializations

Different biological neuron types correspond to different parameter configurations of the L-GRU:

Neuron Type L-GRU Configuration Functional Role Update Gate Bias
Pyramidal (excitatory) Large $W_o$, low $b_z$ (open update gate) Forward signal propagation, integration $b_z \approx -1$ (frequently update)
Interneuron (inhibitory) Negative $W_o$, high $b_r$ (open reset gate) Suppression, gain control, timing $b_z \approx 0$ (context-sensitive)
Granule (encoding) High $b_z$ (closed update gate), large $U_h$ Stable memory storage, sparse coding $b_z \approx +2$ (rarely update)
Purkinje (error) $W_o$ tuned to $\varepsilon_i$ direction Error signal computation and broadcast Adaptive (error-driven)
Proposition 4.1 (Universal Approximation of Node Functions)

The L-GRU with hidden state dimension $d_i \geq K$ is a universal approximator for any continuous function $\phi: \mathbb{R}^{d_x} \times [0,T] \to \mathbb{R}^{d_o}$ on compact domain, for any $K$ depending only on the approximation error $\epsilon$ and the target function's smoothness. Therefore no neuron type requires a fundamentally different computational primitive — only different parameter configurations.

§ 5

Developmental Initialization Theory

The fourth open problem was the developmental origin of the initial graph topology. We now provide a formal model grounded in random graph theory and the mathematics of genetic constraint satisfaction.

5.1 Genetic Encoding as Sparse Random Graph

Definition 5.1 (Genetic Graph Prior)

Let $\mathcal{G}$ denote the space of all directed graphs on $n$ nodes. The genetic encoding defines a probability distribution $\mathbb{P}_{\text{gen}}$ over $\mathcal{G}$ via a set of $m$ genetic constraint functions $g_k : \mathcal{G} \to \{0,1\}$, $k = 1, \ldots, m$: $$\mathbb{P}_{\text{gen}}(G) \propto \exp\!\left(-\sum_{k=1}^m \lambda_k g_k(G)\right) \cdot \mathbf{1}[G \text{ is connected}]$$ This is a Gibbs distribution over graphs with genetic penalty parameters $\lambda_k$ — a maximum entropy prior subject to genetic constraints.

Definition 5.2 (Genetic Constraints)

The biologically motivated constraints include:
$g_1(G)$: Penalizes edges exceeding physical distance threshold $d_{\max}$ (axon length cost).
$g_2(G)$: Penalizes degree sequences deviating from power-law $P(\deg = k) \sim k^{-\alpha}$.
$g_3(G)$: Penalizes graphs without small-world property: $C(G) \ll C(\text{random})$ or $L(G) \gg L(\text{lattice})$.
$g_4(G)$: Rewards modular structure (high intra-cluster / low inter-cluster connectivity).

Theorem 5.1 (Initialization Produces Small-World Graph)

Under $\mathbb{P}_{\text{gen}}$ with the constraints of Definition 5.2, the expected initial graph $G_0 \sim \mathbb{P}_{\text{gen}}$ is a sparse small-world network satisfying: $$\mathbb{E}[C(G_0)] \gg C(G_{\text{random}}), \qquad \mathbb{E}[L(G_0)] \approx L(G_{\text{random}})$$ where $C(G)$ is the clustering coefficient and $L(G)$ is the average shortest path length.

Proof

The constraint $g_3$ directly penalizes deviation from the small-world regime. By the maximum entropy principle, the mode of $\mathbb{P}_{\text{gen}}$ is the distribution that satisfies all constraints while having maximum entropy — which is the Watts-Strogatz small-world model (Watts & Strogatz, 1998). Specifically, the Watts-Strogatz construction with rewiring probability $p_r \in (0.01, 0.1)$ achieves $C \gg C_{\text{random}}$ and $L \approx L_{\text{random}}$ simultaneously. Since our $g_3$ penalizes departures from this regime, the mode of $\mathbb{P}_{\text{gen}}$ concentrates near the Watts-Strogatz regime. The expectation follows by concentration of the Gibbs measure around its mode for large $n$ and positive $\lambda_k$. $\square$

5.2 Experience-Driven Topology Convergence

Definition 5.3 (Experience-Driven Edge Evolution)

Starting from $G_0 \sim \mathbb{P}_{\text{gen}}$, the graph evolves under the Spike-Timing Dependent Plasticity (STDP) rule: $$\Delta W(n_i \to n_j) = \begin{cases} A_+ \exp\!\left(-\frac{\Delta t}{\tau_+}\right) & \text{if } \Delta t = t_j - t_i > 0 \\ -A_- \exp\!\left(\frac{\Delta t}{\tau_-}\right) & \text{if } \Delta t < 0 \end{cases}$$ where $\Delta t$ is the relative spike timing. Edge $(n_i, n_j)$ is created when $W(n_i \to n_j)$ exceeds creation threshold $W_+$, and pruned when it falls below $W_- < 0$.

Theorem 5.2 (Developmental Convergence)

Under the STDP evolution from initial graph $G_0$, the graph $G(t)$ converges almost surely to a fixed-point topology $G^* = G(\infty)$ that: (a) encodes all experienced input statistics in its weight matrix $W^*$; (b) has strictly higher clustering coefficient than $G_0$: $C(G^*) \geq C(G_0)$; (c) has pruned all edges not reinforced by experience, reducing metabolic cost.

Proof

Define the stochastic process $\{W(t)\}_{t \geq 0}$ on the weight matrix. Under STDP, this is a stochastic differential equation: $$dW_{ij} = f_{\text{STDP}}(W_{ij}, \sigma_i, \sigma_j)\, dt + \xi_{ij}(t)\, dB_t$$ where $\xi_{ij}$ is noise from stochastic spiking and $B_t$ is a Brownian motion. By Lyapunov methods applied to this SDE, the drift term $f_{\text{STDP}}$ has the form of a mean-reverting process toward weight values that maximize the mutual information $I(\sigma_i; \sigma_j)$ between spike trains. This has a unique stable equilibrium at $W^* = \arg\max I(\sigma_i; \sigma_j)$ (Oja's rule convergence, 1982). The clustering coefficient increases because STDP preferentially strengthens edges within frequently co-activated groups — the triangles in the graph — directly increasing $C$. Edge pruning follows from the negative $A_-$ term eliminating edges with consistent anti-causal timing. Almost-sure convergence follows from the Borel-Cantelli lemma applied to the supermartingale $\|W(t) - W^*\|$. $\square$

§ 6

Necessary and Sufficient Hardware Conditions

We derive from first principles the necessary and sufficient conditions a physical substrate must satisfy to implement DAN Theory. These are not engineering preferences but mathematical necessities implied by the formal model.

Theorem 6.1 (Necessary Hardware Conditions)

Any physical substrate $\mathcal{H}$ that faithfully implements DAN Theory must satisfy all of the following:

(H1) Continuous state representation. $\mathcal{H}$ must support node states $s_i \in \mathbb{R}^d$ with $d \geq 4$ (input, weight, data, distribution). This rules out purely binary/digital nodes.

(H2) Spike timing resolution. $\mathcal{H}$ must resolve spike timing differences to precision $\delta t \leq 1/f_\gamma \approx 12$ ms for $\gamma$-band binding. This requires temporal resolution at the millisecond scale.

(H3) Local weight persistence. Each node $n_i$ must independently store and update its weight vector $w_i$ without global write access. This requires distributed non-volatile storage, not a shared parameter server.

(H4) Asynchronous event-driven computation. Nodes must compute only when receiving input (event-driven), not on a global clock. This is required for sparse activation and the $O(|S_q|)$ energy bound of Corollary 3.1.

(H5) Bidirectional communication. Each edge must support both forward signal propagation and backward error feedback, at minimum between immediate neighbors.

Proof of Necessity

Each condition follows from a specific part of the formal model:
H1: The state quadruple $(x_i, w_i, d_i, P_i)$ is defined over continuous spaces in Definition 1.2. A binary node cannot represent the L-GRU internal state (Definition 4.1).
H2: Theorem 2.2 requires computing phase coherence $\rho_{ij}(t)$ at $\gamma$-band frequency. Resolving $e^{i(\theta_i - \theta_j)}$ requires timing precision $\delta t \ll 1/40 \text{ Hz} = 25$ ms; conservatively $\delta t \leq 12$ ms.
H3: Theorem 4.1 proves local learnability requires each node to independently store and update $w_i$. A shared weight store violates local autonomy (Definition 1.4).
H4: Corollary 3.1 bounds energy by $O(|S_q|)$ active nodes. A globally clocked system computes over all $n$ nodes per cycle, consuming $O(n) \gg O(|S_q|)$ energy — inconsistent with the 20-watt bound.
H5: Theorem 3.2 requires each node to receive error feedback $\varepsilon_i$ from immediate downstream neighbors. This requires bidirectional edge communication. $\square$

Theorem 6.2 (Sufficiency Conditions)

Conditions H1–H5 are also sufficient: any substrate satisfying H1–H5 can implement DAN Theory up to approximation error $\epsilon > 0$ (from Proposition 4.1).

Proof (Constructive)

Given H1–H5, construct the DAN implementation as follows: (a) represent each node state $(x_i, w_i, d_i, P_i)$ in the continuous state space guaranteed by H1; (b) use the temporal resolution of H2 to compute phase coherence $\rho_{ij}$ via Definition 2.2; (c) use the local weight storage of H3 to implement the L-GRU weight updates of Theorem 4.1; (d) use the event-driven property of H4 to trigger computation only on spike arrival; (e) use the bidirectional edges of H5 to propagate forward signals and backward error. By Proposition 4.1, the L-GRU approximates any continuous node function to within $\epsilon$. All theorems (2.1, 2.2, 3.1, 3.2, 4.1) hold exactly under this construction. $\square$

ConditionCurrent TechnologyGapCandidate Solution
H1: Continuous states Analog circuits, memristors Noise; fabrication precision Memristive crossbar arrays
H2: ms timing resolution FPGA (ns), Loihi (1ms) Intel Loihi meets this requirement Loihi 2 (resolved)
H3: Distributed local weights On-chip SRAM per core Limited capacity; no online STDP at scale Phase-change memory per node
H4: Asynchronous event-driven Neuromorphic chips Loihi/BrainScaleS partially satisfy this Loihi 2, SpiNNaker2 (partial)
H5: Bidirectional edges Not in any current chip All current chips are feedforward only Open research problem

The analysis reveals that H5 is the hardest unsatisfied condition. No current neuromorphic architecture supports on-chip bidirectional error signaling per edge at scale. This is the primary hardware research direction implied by DAN Theory.

§ 7

Empirical Validation Protocol

A theory is only as strong as its falsifiable predictions. We derive from the formal model a set of specific, measurable predictions that distinguish DAN Theory from competing frameworks.

7.1 Prediction P1: $\gamma$-Coherence Predicts Selection

From Theorem 2.2, the probability that neuron $n_j$ is activated following $n_i$ should be a monotonically increasing function of their phase coherence $\rho_{ij}(t)$ at $\gamma$-band: $$\frac{\partial}{\partial \rho_{ij}} P_i(n_j \mid \mathcal{C}_i) > 0 \quad \text{(Prediction P1)}$$ Experimental test: Multi-electrode array recordings during cognitive tasks. Measure $\rho_{ij}$ between candidate neurons 50ms before activation. Test whether high-$\rho$ pairs activate more often than low-$\rho$ pairs, controlling for synaptic weight.

7.2 Prediction P2: Free Energy Rate Predicts Termination

From Theorem 3.2, the end of a cognitive task (response time) should correlate with the time at which $|\dot{F}_i|$ drops below threshold across all active neurons — measurable as a sudden drop in $\gamma$-power variance: $$t_{\text{response}} \approx \min\{t : \text{Var}[|\dot{\sigma}_i(t)|_{i \in S_q}] < \delta\} \quad \text{(Prediction P2)}$$ Experimental test: EEG/MEG recordings during decision tasks. Response time should correlate ($r > 0.7$) with the time of $\gamma$-power variance collapse, not with mean $\gamma$-power.

7.3 Prediction P3: Memory Interference Scales with Overlap

From Corollary 2.1, the interference between two memories $M$ and $M'$ should scale with $|S_M \cap S_{M'}|$: $$\text{Interference}(M, M') \propto \frac{|S_M \cap S_{M'}|}{|S_M| + |S_{M'}|} \quad \text{(Prediction P3)}$$ Experimental test: fMRI BOLD signal during dual-memory encoding tasks. Pairs of semantically similar memories (higher predicted overlap) should show more interference than semantically distant pairs. The interference coefficient should match the predicted proportionality.

7.4 Prediction P4: Attention Redistributes $\gamma$-Power

From the parallel traversal temperature model (Section 7 of the original paper), attentional capture should produce a measurable redistribution of $\gamma$-power from the interrupted task's neural population to the capturing task's population, conserving total power: $$\sum_q \text{Power}_q^{(\gamma)}(t) = \text{const} \quad \text{(Prediction P4)}$$ Experimental test: EEG during dual-task experiments with attentional interruption. Verify conservation of total $\gamma$-power under attentional shift, against alternative models that predict net $\gamma$-power increase.

Open Problem 7.1 (Binding Constant Estimation)

The binding threshold $\rho^*$ and the coherence influence parameter $\beta$ in Definition 2.4 are free parameters of the theory. Their values must be estimated from empirical data. We predict $\rho^* \in (0.4, 0.7)$ and $\beta \in (1, 5)$ based on analogy with Kuramoto coupling constants, but precise estimation requires multi-unit recording studies with simultaneous LFP and spike train measurements.

§ 8

The Unified DAN Theorem

We now state the central result unifying all components of the framework.

Theorem 8.1 (Unified DAN Theorem)

Let $G = (N, E, W, T)$ be a dynamic directed graph satisfying hardware conditions H1–H5 (Theorem 6.1), initialized under the genetic prior $\mathbb{P}_{\text{gen}}$ (Definition 5.1). Let each node $n_i$ implement the L-GRU dynamics (Definition 4.1) with local gradient updates, and let the binding function $R$ be the phase-coherence operator of Definition 2.4.

Then the following hold simultaneously:

(1) Binding without central coordination. For any traversal $\mathcal{T}_q$, the coherent assembly $\mathcal{A}_q(t)$ uniquely identifies traversal context for every shared node (Theorem 2.2), resolving superposition without global state.

(2) Self-termination. Every traversal $\mathcal{T}_q$ terminates in finite time $T^* < \infty$ through local free energy minimization alone (Theorem 3.2).

(3) Local learnability. Every node $n_i$ learns its optimal parameters $w_i^*$ via local gradient descent on $F_i$ without non-local gradient information (Theorem 4.1).

(4) Developmental convergence. Starting from sparse random topology $G_0$, the graph converges to an experience-encoded fixed point $G^*$ under STDP (Theorem 5.2).

(5) Energy efficiency. Total metabolic energy per traversal is $O(|S_q| \cdot T^*)$ — consistent with the 20-watt budget for typical $|S_q| \ll |N|$ (Corollary 3.1).

(6) Universal approximation. The system can approximate any continuous cognitive mapping $f: \mathcal{X} \to \mathcal{Y}$ to within $\epsilon > 0$ given sufficient node count $n$ and hidden dimension $d$ (Proposition 4.1 extended to graph level).

Proof

Statements (1)–(5) follow directly from the cited theorems, which have been proved independently above. The conditions of each theorem are satisfied by assumption: H1–H5 satisfy the hardware requirements; L-GRU satisfies the node function requirements; phase-coherence satisfies the binding requirements; STDP satisfies the developmental requirements.

Statement (6): By Proposition 4.1, each L-GRU node is a universal approximator for continuous functions on compact domains. A graph of such nodes with sufficient connectivity and depth can approximate any continuous mapping $f: \mathcal{X} \to \mathcal{Y}$ by the universal approximation theorem for deep networks (Cybenko, 1989; Hornik, 1991) extended to recurrent architectures (Schäfer & Zimmermann, 2006). The required conditions (depth, width, activation functions) are satisfied by the L-GRU configuration space.

The six statements are mutually consistent because: (1) does not require global state; (2) requires only local monitoring; (3) requires only neighbor feedback; (4) is a separate timescale (developmental) from (1)–(3) (operational); (5) follows from the sparsity of each individual proof's requirements. No statement assumes conditions that contradict another. $\square$

Corollary 8.1 (AGI Impossibility of Current Architectures)

No architecture that violates any of H1–H5 can implement DAN Theory and therefore cannot replicate biological cognition. Current transformer-based LLMs violate H1 (discrete token states), H3 (global shared parameter server), H4 (synchronous batched computation), and H5 (no backward error signaling per node). Therefore current LLM architectures are provably insufficient for biological-equivalent general intelligence, regardless of scale.

Open Problem 8.1 (Consciousness Condition)

Theorem 8.1 fully characterizes cognition and memory in the DAN framework. It does not characterize consciousness. We conjecture that consciousness corresponds to a meta-stable high-coherence state: $\exists\, S^* \subseteq N$ such that $\mathcal{A}_{S^*}(t) = S^*$ (the assembly is self-sustaining) and $|S^*| > \Omega(|N|^{\alpha})$ for some $\alpha \in (0,1)$. Formalizing and proving this conjecture — relating it to Integrated Information Theory or Global Workspace Theory — remains the primary open problem of the complete DAN framework.

§

References

  1. Kuramoto, Y. (1984). Chemical Oscillations, Waves, and Turbulence. Springer. [Phase-locking theory for coupled oscillators — Theorem 2.2]
  2. Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience, 11, 127–138. [Free energy minimization framework — Section 3]
  3. Watts, D.J. & Strogatz, S.H. (1998). Collective dynamics of 'small-world' networks. Nature, 393, 440–442. [Small-world graph model — Theorem 5.1]
  4. Oja, E. (1982). A simplified neuron model as a principal component analyzer. Journal of Mathematical Biology, 15, 267–273. [Weight convergence — Theorem 5.2]
  5. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314. [Universal approximation — Theorem 8.1]
  6. Cho, K. et al. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv:1406.1078. [GRU architecture — Definition 4.1]
  7. Schäfer, A.M. & Zimmermann, H.G. (2006). Recurrent neural networks are universal approximators. International Journal of Neural Systems, 17(4), 253–263. [Universal approximation for recurrent networks — Theorem 8.1]
  8. Bi, G. & Poo, M. (1998). Synaptic modifications in cultured hippocampal neurons. Journal of Neuroscience, 18(24), 10464–10472. [STDP rule — Definition 5.3]
  9. Rao, R.P.N. & Ballard, D.H. (1999). Predictive coding in the visual cortex. Nature Neuroscience, 2, 79–87. [Predictive coding — Section 3]
  10. Tononi, G. (2004). An information integration theory of consciousness. BMC Neuroscience, 5, 42. [Consciousness formalization — Open Problem 8.1]
  11. Davies, M. et al. (2018). Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro, 38(1), 82–99. [Hardware substrate — Section 6]
  12. Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2), 251–257. [Universal approximation — Theorem 8.1]

Mathematical Foundations Extension. Preprint. Not peer reviewed.

Correspondence: Bharat Rawat · India

© 2026 Bharat Rawat. This work may be freely cited with attribution.