MH-FLOCKE MH-FLOCKE
HomeDocsGitHubBlogPaperYouTubeReddit𝕏

R-STDP Learning

Reward-Modulated Spike-Timing-Dependent Plasticity

R-STDP is the primary learning rule in MH-FLOCKE. It combines two biological principles: spike-timing-dependent plasticity (STDP), where the relative timing of pre- and post-synaptic spikes determines whether a synapse strengthens or weakens, and reward modulation, where a global neuromodulatory signal (dopamine) gates whether the timing-based changes are actually applied.

The Free Energy Principle

MH-FLOCKE follows the Free Energy Principle (Friston 2010) rather than pure reward maximization. The key insight: organisms minimize surprise (prediction error), not maximize reward. This is supported by Cortical Labs’ DishBrain experiments (Kagan et al. 2022), where biological neurons in a dish learned to play Pong purely through prediction error minimization — with no reward signal at all.

In practice, the learning signal blends both:

combined_signal = 0.1 × reward + 0.9 × (−prediction_error)

Prediction error dominates at 90%. This means the SNN primarily learns to make the world predictable, with reward as a secondary guide for which predictions matter.

Mathematical Formulation

The weight update rule per synapse:

Δw = lr × combined_signal × eligibility

where:
  lr = base_lr × (1 + ACh)           — ACh boosts learning rate
  eligibility = clamp(trace, -1, 1)  — accumulated spike coincidences
  combined_signal = 0.1×R + 0.9×(-PE) — when task PE active
                  = R                  — when no task PE (flat meadow)

Task Prediction Error (Issue #79)

The global World Model prediction error (~0.004) is too small to drive navigation learning. MH-FLOCKE introduces a task-specific PE inspired by DishBrain’s local stimulation principle:

dist_PE = (ball_dist - ref_dist) / ref_dist

Asymmetric penalty (loss aversion, Kahneman 1979):
  if walking_away: dist_PE *= 2.0

Proximity brake (Issue #79c):
  if ball_dist < 0.5m and departing:
    dist_PE = departure × 10.0

Additionally, when task PE is positive (failing), vision input neurons receive a current boost proportional to PE — forcing the SNN to attend to the ball signal. This is the DishBrain principle: chaos when wrong, calm when right.

Eligibility Traces

Eligibility traces accumulate spike coincidences over time, implementing a biologically plausible credit assignment mechanism. When a pre-synaptic neuron fires shortly before a post-synaptic neuron (causal timing), the eligibility trace increases. The trace decays by 0.3 after each learning step, preventing stale correlations from affecting learning.

E/I Balance (Dale's Law)

Excitatory and inhibitory neurons maintain their sign throughout learning. Excitatory weights are clamped to [0, 1], inhibitory weights to [-1, 0]. This ensures the network maintains biological plausibility and prevents pathological dynamics.

Protected Populations

Cerebellar populations (PkC, DCN) are protected from R-STDP updates. Their weights are modified only by the cerebellar learning rule (climbing fiber-driven LTD). This prevents the reward signal from interfering with the forward model.

Results

The Free Energy approach with task PE produces a 457-frame proximity streak (dog stays near ball), compared to approach-overshoot-walkaway behavior with reward-only learning. The asymmetric PE (loss aversion) creates a ratchet effect where the SNN strongly avoids increasing ball distance.

References

  • Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience
  • Kagan, B.J. et al. (2022). In vitro neurons learn and exhibit sentience when embodied in a simulated game-world. Neuron
  • Frémaux, N. & Gerstner, W. (2016). Neuromodulated spike-timing-dependent plasticity. Frontiers in Computational Neuroscience
  • Kahneman, D. & Tversky, A. (1979). Prospect theory: An analysis of decision under risk. Econometrica

API Reference

SNNController.apply_rstdp(reward_signal, prediction_error)

The R-STDP learning rule lives in snn_controller.py. See SNN Controller docs for full API.

Learning Signal Computation

if |prediction_error| > 0.05:   # Task PE active (ball scene)
    signal = 0.1 * reward + 0.9 * (-PE)
else:                            # No task PE (flat meadow)
    signal = reward

dw = lr * (1 + ACh) * signal * clamp(eligibility, -1, 1)
dw = clamp(dw, -0.05, 0.05)     # Max update per step

Eligibility Trace Update (per synapse, per step)

d_elig = pre_spike * post_trace - post_spike * pre_trace
eligibility = eligibility * 0.95 + d_elig
# After apply_rstdp: eligibility *= 0.3 (decay after use)

Task PE (cognitive_brain.py, lines 294–370)

State-based distance PE

dist_PE = (ball_dist - 3.0) / 3.0
if walking_away: dist_PE *= 2.0          # Loss aversion
if ball_dist < 0.5 and departing:
    dist_PE = departure * 10.0            # Proximity brake
heading_PE = |ball_heading| * 0.3         # Only when > 1.5m
task_PE = clamp(dist_PE + heading_PE, -2, 2)

DishBrain Vision Boost

if task_PE > 0.05:
    snn.V[vision_neurons] += task_PE * 0.5  # Last 16 input neurons