Reward-Modulated Spike-Timing-Dependent Plasticity
R-STDP is the primary learning rule in MH-FLOCKE. It combines two biological principles: spike-timing-dependent plasticity (STDP), where the relative timing of pre- and post-synaptic spikes determines whether a synapse strengthens or weakens, and reward modulation, where a global neuromodulatory signal (dopamine) gates whether the timing-based changes are actually applied.
The Free Energy Principle
MH-FLOCKE follows the Free Energy Principle (Friston 2010) rather than pure reward maximization. The key insight: organisms minimize surprise (prediction error), not maximize reward. This is supported by Cortical Labs’ DishBrain experiments (Kagan et al. 2022), where biological neurons in a dish learned to play Pong purely through prediction error minimization — with no reward signal at all.
In practice, the learning signal blends both:
combined_signal = 0.1 × reward + 0.9 × (−prediction_error)
Prediction error dominates at 90%. This means the SNN primarily learns to make the world predictable, with reward as a secondary guide for which predictions matter.
Mathematical Formulation
The weight update rule per synapse:
Δw = lr × combined_signal × eligibility
where:
lr = base_lr × (1 + ACh) — ACh boosts learning rate
eligibility = clamp(trace, -1, 1) — accumulated spike coincidences
combined_signal = 0.1×R + 0.9×(-PE) — when task PE active
= R — when no task PE (flat meadow)
Task Prediction Error (Issue #79)
The global World Model prediction error (~0.004) is too small to drive navigation learning. MH-FLOCKE introduces a task-specific PE inspired by DishBrain’s local stimulation principle:
dist_PE = (ball_dist - ref_dist) / ref_dist
Asymmetric penalty (loss aversion, Kahneman 1979):
if walking_away: dist_PE *= 2.0
Proximity brake (Issue #79c):
if ball_dist < 0.5m and departing:
dist_PE = departure × 10.0
Additionally, when task PE is positive (failing), vision input neurons receive a current boost proportional to PE — forcing the SNN to attend to the ball signal. This is the DishBrain principle: chaos when wrong, calm when right.
Eligibility Traces
Eligibility traces accumulate spike coincidences over time, implementing a biologically plausible credit assignment mechanism. When a pre-synaptic neuron fires shortly before a post-synaptic neuron (causal timing), the eligibility trace increases. The trace decays by 0.3 after each learning step, preventing stale correlations from affecting learning.
E/I Balance (Dale's Law)
Excitatory and inhibitory neurons maintain their sign throughout learning. Excitatory weights are clamped to [0, 1], inhibitory weights to [-1, 0]. This ensures the network maintains biological plausibility and prevents pathological dynamics.
Protected Populations
Cerebellar populations (PkC, DCN) are protected from R-STDP updates. Their weights are modified only by the cerebellar learning rule (climbing fiber-driven LTD). This prevents the reward signal from interfering with the forward model.
Results
The Free Energy approach with task PE produces a 457-frame proximity streak (dog stays near ball), compared to approach-overshoot-walkaway behavior with reward-only learning. The asymmetric PE (loss aversion) creates a ratchet effect where the SNN strongly avoids increasing ball distance.
References
- Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience
- Kagan, B.J. et al. (2022). In vitro neurons learn and exhibit sentience when embodied in a simulated game-world. Neuron
- Frémaux, N. & Gerstner, W. (2016). Neuromodulated spike-timing-dependent plasticity. Frontiers in Computational Neuroscience
- Kahneman, D. & Tversky, A. (1979). Prospect theory: An analysis of decision under risk. Econometrica
API Reference
SNNController.apply_rstdp(reward_signal, prediction_error)
The R-STDP learning rule lives in snn_controller.py. See SNN Controller docs for full API.
Learning Signal Computation
if |prediction_error| > 0.05: # Task PE active (ball scene)
signal = 0.1 * reward + 0.9 * (-PE)
else: # No task PE (flat meadow)
signal = reward
dw = lr * (1 + ACh) * signal * clamp(eligibility, -1, 1)
dw = clamp(dw, -0.05, 0.05) # Max update per step
Eligibility Trace Update (per synapse, per step)
d_elig = pre_spike * post_trace - post_spike * pre_trace eligibility = eligibility * 0.95 + d_elig # After apply_rstdp: eligibility *= 0.3 (decay after use)
Task PE (cognitive_brain.py, lines 294–370)
State-based distance PE
dist_PE = (ball_dist - 3.0) / 3.0
if walking_away: dist_PE *= 2.0 # Loss aversion
if ball_dist < 0.5 and departing:
dist_PE = departure * 10.0 # Proximity brake
heading_PE = |ball_heading| * 0.3 # Only when > 1.5m
task_PE = clamp(dist_PE + heading_PE, -2, 2)
DishBrain Vision Boost
if task_PE > 0.05:
snn.V[vision_neurons] += task_PE * 0.5 # Last 16 input neurons