My Robot Dog Couldn’t Walk Straight — 8 Bugs and a New Steering System
The Freenove robot dog has a problem. It drifts to the right. Every unit is different — servo tolerances, weight distribution, surface friction — and no amount of calibration fixes it permanently. The drift changes with battery level, temperature, and floor surface.
For weeks, I tried to fix this with Z-offset steering — shifting all four feet sideways to create a turning moment. It seemed logical. It was useless. A 45-second hardware test proved it: ±5mm of Z-offset produces less than 5 degrees of effect against 70 degrees of mechanical drift. One measurement killed weeks of assumptions.
Tank Steering
The replacement is asymmetric stride — differential hip amplitude between left and right legs. Left legs take a longer step, the dog curves right. Like a tank. Biology does this through the reticulospinal tract, which modulates stride length independently per side.
Hardware test: drift reduced from 70 degrees to 8.5 degrees. Three times more effective than Z-offset, and it works on any surface because the IMU provides closed-loop feedback.
The PID controller reads the actual heading from the MPU6050 IMU, compares it to the target heading from the camera (where the light is), and drives the stride asymmetry. No calibration, no per-robot tuning. The I-term accumulates over time to eliminate steady-state offset — exactly what the cerebellum does biologically through long-term depression.
Eight Bugs in One Session
The steering replacement exposed eight bugs that had been hiding in the system:
The steering signal was computed correctly but silently dropped in compute_tendon() — wrong code path, one line fix. The log was showing a proxy value instead of the actual steering — the controller had been working the whole time, but the display said zero. The competence gate required walking speed above 0.03 m/s, but with drift, all locomotion energy went to correction instead of forward progress. The baby never grew up. MuJoCo yaw convention is inverted versus hardware — one minus sign. A threshold prevented target updates when the dog was roughly aimed at the light. The PD controller initialized its target to zero instead of the current heading.
Each one of these individually prevented the system from working. Finding them required switching between simulation and real hardware, comparing signs and values, and measuring instead of guessing.
The Dog Approaches the Light
After fixing all eight bugs and tuning the PID on hardware (Kp=0.05, Ki=0.01, Kd=0.015), the Freenove robot approaches a light source from 0.52 meters to 0.17 meters in 60 seconds. Not perfect tracking — the steering saturates near the end — but genuine, IMU-corrected, drift-compensated navigation on a 100-euro robot kit.
In simulation with measured hardware drift injected: 50,000 steps, zero falls, three light targets found, actor competence 1.0.
Meta-Learning Loop
This release also introduces the complete autonomous meta-learning loop — four modules that form a closed self-improvement cycle:
EpisodeAnalyzer compares successful versus unsuccessful navigation events and identifies what makes the dog successful. Which context variables (gait quality, heading error, velocity, steering offset) correlate with finding the light?
StrategyAdapter converts those insights into parameter adjustments — modifying run/tumble duration, PID gains, and exploration bias.
CuriosityExplorer uses world model prediction error to drive exploration. High prediction error means unfamiliar territory — explore more. Low prediction error means familiar ground — exploit what works.
HypothesisGenerator creates testable motor hypotheses from insights that can be tested autonomously through the existing Directed Learning module.
The loop runs but has not generated insights yet — the dog is too successful in the current scenario (100% success rate, no failures to learn from). Harder scenarios and longer runs will activate it.
Hardware Drift Simulation
The drift profile injected into the simulator has been updated. The previous measurement (-0.4 deg/s) was taken during a stationary test. Under walking load, the actual drift is 1.5 to 2.0 deg/s — servo asymmetry amplifies under dynamic conditions. The updated profile makes simulation training more realistic.
Every Freenove unit has different drift characteristics. MH-FLOCKE handles this automatically through the PID controller. No manual calibration needed.
What Comes Next
Hardware video with the new PID steering. Longer simulation runs to activate the meta-learning loop. A potential third robot platform (Petoi Bittle X V2) to prove the architecture is body-agnostic.
The code is on GitHub (Apache 2.0). Updated documentation at mhflocke.com/docs.
v0.5.1 — Full changelog

The Dog Finds Its Target — No Reward Required
The Freenove robot dog can now navigate to targets on its own. No external reward, no reward shaping, no supervision. It sniffs, turns, runs, and finds what it’s looking for — using a biological navigation pattern that bacteria figured out billions of years ago.
What happened
The Freenove robot dog — a 100-euro kit with a Raspberry Pi and 12 servos — is controlled by a network of 560 spiking Izhikevich neurons. Not conventional neural networks, but biologically realistic neurons that fire like real brain cells.
In simulation, we place scent sources on the ground. The dog can sense the scent intensity and direction. It has to figure out how to get there by itself.
The result after 33,000 steps: 5.43 meters walked, 4 scent sources found, zero falls. That’s 61% further than the baseline without scent, which just walked straight ahead blindly.
Why this is harder than it sounds
The obvious approach — continuously steer toward the smell — doesn’t work. We tried it. The dog spiraled in circles. Every step, it corrected its heading. Every correction shifted the scent angle. The feedback loop turned into a death spiral.
This is a classic engineering mistake. Real animals don’t navigate like PID controllers.
How real animals navigate
Bacteria solved this problem billions of years ago. The mechanism was described by Berg and Brown in 1972: Run-and-Tumble.
The principle is simple. Sniff — measure the gradient. Tumble — a brief turning impulse toward the source. Run — walk straight, no corrections. Then sniff again. If the scent got stronger during the run, extend the next straight phase. If not, correct more often.
It’s not continuous steering. It’s a rhythm of orientation and movement. Sniff, turn, run. Sniff, turn, run. Like a dog following a trail.
What we built
We implemented this biological pattern as a state machine in the training loop. Three states: SNIFF (1 step — measure the gradient), TUMBLE (12 steps — steering impulse), RUN (40 steps — straight ahead, no corrections).
The system includes an improvement check: if scent strength increased after a run, the next straight phase gets extended. The dog found the right direction — keep going. If not, the phase stays short and it corrects more frequently.
Three bugs, one breakthrough
It didn’t work immediately. Three bugs were hiding in the system.
First, the heading computation was wrong. A quaternion has four components, and we used the W component as the yaw angle. That value is always approximately 1.0. The dog had been sniffing in the wrong direction for weeks without us noticing.
Second, the scent radius was too small. The Freenove is a small robot with short steps. At a target radius of 0.5 meters, it walked past targets more often than into them.
Third, new scent sources spawned behind the dog instead of ahead of it, because the respawn logic didn’t account for the walking direction.
After fixing all three: 4 targets found, 5.43 meters, zero falls. And the Directed Learning module autonomously tested and confirmed a hypothesis about gait frequency — without us programming that behavior.
What this means
The system now consists of: a spiking neural network that learns motor control, a cerebellum that corrects movements in real time, a central pattern generator that provides the base gait, an emotion and motivation system, episodic memory, and biologically grounded navigation.
None of these modules use external reward. The dog doesn’t get points for walking or arriving. It learns from body signals: losing balance feels bad, moving feels good, curiosity drives it forward.
And it runs on a 100-euro robot. Same code in simulation and on the Raspberry Pi.
What’s next
The next step is a closed learning loop: after each episode, the dog asks itself what worked and what didn’t. The building blocks exist — episodic memory, concept graph, world model, directed learning. They just need to be connected.
On real hardware, scent becomes light: the Freenove has a camera, and brightness in the image is a gradient just like scent in the air. Same algorithm, different sensor. Point a flashlight at the floor, the dog walks toward it.
The video is on YouTube https://www.youtube.com/watch?v=phYPEFLMlJI.
Code on GitHub: github.com/MarcHesse/mhflocke

v0.4.3: The Wall — When a Spiking Neural Network Learns to Stop
A dog doesn’t need to crash into a wall twice. The first bump tells the whiskers something is wrong, the brainstem slams the brakes, and the cerebellum remembers: next time, slow down earlier.
Today’s update teaches the Freenove robot dog the same lesson — using the same biological architecture.
The Experiment
The setup is simple: a small quadruped robot, 232 spiking neurons, a wall 80cm ahead. No pre-programming, no path planner, no reward shaping beyond “hitting the wall is bad.” The question: can a biologically grounded spiking neural network learn active behavior change from a clear binary signal?
The answer, after 9 bug fixes and 8 training runs: yes.
What We Found (and What Broke Along the Way)
The first seven runs produced corrections of exactly 0.0000. The cerebellum was learning — PF→PkC weights were growing — but the corrections never reached the motors. Three fundamental architecture bugs were hiding in the pipeline:
Bug 1: The DCN was deaf. Deep Cerebellar Nuclei compute motor corrections as the difference between “push” and “pull” populations. But the DCN was reading Purkinje cell spike activity — an exponential moving average that was essentially zero because PkC rarely fire discrete spikes. The fix: read the graded compartment state (apical voltage + dendritic calcium), not just spikes. This is more biologically accurate — PkC→DCN synapses show graded GABAergic release proportional to membrane potential.
Bug 2: Symmetric climbing fibers. When the robot hit the wall, the Inferior Olive sent identical error signals to both push and pull Purkinje cells. Identical calcium → identical DCN inhibition → push minus pull = zero → corrections = zero. Always. The fix: asymmetric CF — push PkC get strong CF (0.9), pull PkC get weak CF (0.1). Biology does the same thing: when an animal hits an obstacle, the correction is to reduce forward drive and increase braking.
Bug 3: Weighted blending killed corrections. The CPG-SNN blend was cpg × weight + correction × (1 - weight). With CPG at 90%, corrections were multiplied by 0.1 — a 10× attenuation. The fix: additive blending — cpg × weight + correction. The cerebellum modulates the CPG via the reticulospinal tract; it doesn’t compete with it.
The Breakthrough: Run 8
After fixing all three bugs, Run 8 showed something we’d never seen before:
- Corrections alive: 0.001–0.012 (was 0.0000 in all previous runs)
- 6 wall collisions in 20,000 steps (episodic learning working)
- Actor competence: 0.000 → 0.299 (first non-zero ever)
- CPG weight: 90% → 55% (first handoff ever)
The robot walks toward the wall. The ultrasonic sensor (Channel 18) fires. DA drops from 0.22 to 0.05. The obstacle climbing fiber activates asymmetrically. The cerebellum learns. After each collision, the robot is reset to the start — like a puppy’s owner picking it up after it bumps into furniture.
The Architecture
The obstacle avoidance system adds three new components to MH-FLOCKE’s biological stack:
Ultrasonic Sensor (Channel 18): Simulated HC-SR04 rangefinder in MuJoCo, real HC-SR04 on the Raspberry Pi. Same encoding in both: nonlinear proximity mapping (√ function for urgency). The sensor channel is identical between simulation and hardware, enabling direct brain transfer.
Obstacle Climbing Fiber: Three zones with asymmetric error signals — COLLISION (<10cm, strong CF on push PkC), DANGER (<30cm, graded asymmetric CF), WARNING (<80cm, hip yaw CF for turning).
Trigeminal Brake: A hardwired reflex that reduces CPG amplitude near obstacles. Without this, the CPG at 90% overpowers any cerebellar correction. The brake creates space for learning.
What This Means
This is the first time in MH-FLOCKE’s history that the SNN has produced non-zero motor corrections that actually changed the robot’s behavior. The CPG handoff — from 90% to 55% — means the spiking neural network is taking over motor control from the innate rhythm generator.
The graded DCN fix alone improves every training run, not just obstacle scenes. Any user cloning the repo now gets a cerebellum that actually works.
Try It
git clone https://github.com/MarcHesse/mhflocke
cd mhflocke
# Obstacle avoidance (Freenove, 20k steps)
python scripts/train_v032.py --creature-name freenove \
--scene "walk toward wall" --steps 20000 \
--no-terrain --no-sensory --no-vision --hardware-sensors \
--auto-reset 500 --fresh
# Normal walking (Go2, flat)
python scripts/train_v032.py --creature-name go2 \
--scene "walk on flat meadow" --steps 50000 --no-terrain
What’s Next
The 50k run with episodic wall training, the same experiment on the Go2 (4,500 neurons), and eventually on the real Freenove hardware with the HC-SR04 sensor. The wall is just the beginning — the architecture now supports any binary sensory signal → cerebellar correction loop.
Named after Flocke — my dog who never needed 9 bug fixes to avoid a wall.

Why We Replaced Reward Shaping with Free Energy
Architecture Deep Dive · March 22, 2026 · MH-FLOCKE Level 15 v0.4.x
Every reinforcement learning tutorial starts the same way: define a reward function. Want the robot to walk? Reward forward velocity. Want it to reach a target? Reward proximity. Want it to stay upright? Penalize falling.
It works. PPO, SAC, and TD3 can solve locomotion tasks in hours. But there’s a problem that becomes obvious the moment you try to build something that actually behaves like an animal: reward functions are lies we tell the optimizer.
MH-FLOCKE doesn’t use reward shaping. It uses Free Energy — a framework from computational neuroscience that turns prediction errors into action. Here’s why that matters, and what it took to make it work.
The Problem with Rewards
A reward function encodes what the designer wants, not what the agent understands. When you write reward = forward_velocity * 0.5 - torque_penalty * 0.01 + alive_bonus * 1.0, you’re injecting your knowledge of physics, biomechanics, and task structure into a scalar signal. The agent never learns why moving forward is good. It learns that a particular number goes up when certain joint angles coincide with certain body velocities.
This creates three specific problems:
Reward hacking. The agent finds ways to maximize the number that have nothing to do with the intended behavior. A walking robot that discovers it can get alive_bonus by vibrating in place. A ball-chasing agent that orbits the ball at exactly the distance where reward is maximized without ever touching it.
Brittle transfer. Change the terrain, the body, or the task even slightly, and the carefully tuned reward weights collapse. A reward function tuned for flat ground produces bizarre gaits on slopes because the relative importance of balance vs. speed shifts — but the weights don’t.
No intrinsic motivation. Turn off the reward, and the agent stops. It has no reason to explore, no curiosity, no drive. In biological systems, animals explore even without external reward because the nervous system is fundamentally organized around reducing prediction error — not maximizing an external signal.
Free Energy: Prediction Error as the Universal Currency
The Free Energy Principle, formulated by Karl Friston, proposes that biological systems minimize the difference between what they predict and what they observe. This isn’t a reward — it’s an error signal. The organism builds a generative model of its world and acts to make that model’s predictions come true.
In MH-FLOCKE, this translates to a concrete mechanism. The system maintains predictions about its sensory states — joint angles, body orientation, distance to objects. When reality deviates from prediction, that deviation becomes the prediction error (PE). The system then has two options: update its model (perception) or act to change the world (action).
The key insight: you don’t need to tell the system what’s good. You need to tell it what to expect. If the system expects to be near the ball, being far from the ball creates prediction error. The system will act to reduce that error — not because it’s been rewarded for approaching, but because the discrepancy between expectation and reality is aversive at a fundamental computational level.
Implementation: Task-Specific Prediction Error
The abstract principle needed concrete engineering. Here’s how Free Energy works in MH-FLOCKE’s code.
The brain computes a Task-Specific Prediction Error (TPE) every simulation tick:
TPE = (ball_distance - expected_distance) / normalization_factor
This TPE feeds into three systems simultaneously:
1. The SNN learning rule. R-STDP modulates synaptic plasticity based on a combination of reward and prediction error: modulation = 0.1 × reward + 0.9 × (−PE). When the dog approaches the ball, PE decreases, the negative of that decrease is positive, and synapses that contributed to the approach get strengthened. The 90/10 split means prediction error dominates — the system learns primarily from its own internal error signal, not from external reward.
2. The Vision Boost. When TPE exceeds a threshold, the last 16 input neurons — carrying environmental sensory information — get amplified proportional to the error magnitude. This is biological attention: unexpected stimuli become more salient. The dog literally pays more attention to the ball when its predictions about ball distance are wrong.
3. Neuromodulation. TPE drives dopamine release in the simulated neuromodulatory system. High positive PE (far from expected position) triggers exploration via norepinephrine. Decreasing PE triggers dopamine, reinforcing the current behavioral strategy. This creates a natural explore-exploit balance without epsilon-greedy or entropy regularization.
What We Lost (and What We Gained)
Free Energy is not free. Compared to PPO with a well-tuned reward function, here’s what changed:
Lost: Speed of convergence. PPO can solve ball-approach in 50k steps with a dense reward. MH-FLOCKE needs 100k steps with the curriculum. The prediction error gradient is weaker than a hand-designed reward — the signal-to-noise ratio is lower because the system has to discover the relevance of its own error signals.
Lost: Simplicity. A reward function is 5 lines of code. The Free Energy implementation spans the SNN controller, the vision boost module, the neuromodulatory system, and the R-STDP learning rule. It’s distributed across the architecture, not centralized in one function.
Gained: Robustness. The 10-seed ablation study showed that MH-FLOCKE’s variance across seeds is dramatically lower than PPO. When it works, it works consistently — because the learning signal comes from internal prediction dynamics, not from the accident of which random seed produces a favorable initial exploration trajectory.
Gained: Emergent behavior. The dog developed behavioral sequences — sniff → walk → trot → chase → alert — that were never programmed and never rewarded. They emerged because the prediction error landscape naturally creates behavioral attractors. When the ball is far, prediction error is high, driving fast locomotion. When close, PE drops, and the gait naturally slows. The transitions aren’t state-machine logic — they’re the dynamics of a system minimizing its own surprise.
Gained: Transfer potential. The same Free Energy architecture that drives ball approach also drives obstacle avoidance, terrain adaptation, and righting after falls. Change the prediction (expect flat ground → encounter a slope), and the system adapts — not because we wrote a slope-reward, but because the prediction error automatically captures the relevant discrepancy.
The Honest Result
Our ablation study produced one genuinely negative finding: motivational drives (hunger, curiosity, social) don’t significantly improve locomotion quality. Configuration B (SNN + Cerebellum, no drives) performs identically to Configuration C (SNN + Cerebellum + drives). The drives affect navigation — which direction the dog goes — but not how well it walks.
This is a real limitation. Free Energy as implemented in MH-FLOCKE is primarily a navigation framework, not a locomotion framework. The actual walking comes from CPGs and the cerebellar forward model. Free Energy tells the dog where to go, not how to move its legs.
In biological systems, these aren’t separate — the motivation to move and the mechanics of movement are deeply intertwined through spinal-cortical loops. MH-FLOCKE’s current architecture treats them as modular, which is both its engineering strength and its biological weakness.
What’s Next
The next step is closing the loop: letting Free Energy modulate not just navigation but gait selection. When prediction error is high (ball is far, terrain is rough), the system should shift to a more cautious gait. When PE is low (ball is close, ground is flat), it should accelerate. The CPG already supports multiple gaits — the missing piece is using prediction error to select between them.
But the core insight stands: you don’t need to tell a system what’s good. You need to give it the ability to predict, and the drive to minimize the gap between prediction and reality. Everything else — approach, avoidance, exploration, caution — emerges from the dynamics of a system that hates being surprised.
MH-FLOCKE is an independent research project by Marc Hesse in Potsdam, Germany. Read the full technical details in our research paper or watch the latest results on YouTube.

Ball Contact — What 4 Changes Made It Work
Dev Log #1 · March 22, 2026 · MH-FLOCKE Level 15 v0.4.x
For weeks, the dog walked beautifully but ignored the ball completely. It would stroll past it, around it, occasionally bump into it by accident — but never pursue it. The spiking neural network was learning to walk. It just had no reason to care about a red sphere sitting on the grass.
Then, in a single 100k-step training run, everything changed. The Go2 quadruped turned toward the ball, approached it deliberately, and made contact — 294 frames of sustained ball interaction, with a minimum distance of 0.8 centimeters.
No reward shaping. No hardcoded “go to ball” command. Four architectural changes made a biologically grounded system do something that PPO with dense rewards still struggles with.
Here’s what happened.
The Problem: Walking Without Purpose
MH-FLOCKE’s brain runs a 15-step cognitive cycle every simulation tick. Spiking neurons fire. The cerebellum predicts motor outcomes. Central pattern generators produce rhythmic gaits. Neuromodulators shift between exploration and exploitation.
But all of this was happening in a closed loop. The SNN received sensory input that included ball distance and angle — the information was there. The network just had no gradient to follow. Ball distance was one of 80+ input dimensions, buried in proprioceptive noise. The R-STDP learning rule couldn’t distinguish “getting closer to the ball” from random fluctuation.
The system needed a way to feel that the ball matters.
Change 1: Task-Specific Prediction Error
Instead of using a generic reward signal, I introduced a task-specific prediction error (TPE) that directly encodes “how far am I from where I should be”:
TPE = (ball_dist - 3.0) / 3.0
When the dog is 3 meters from the ball, TPE is 0 — neutral. Closer than 3 meters, TPE goes negative — the world is better than expected. Further away, TPE grows positive — something is wrong.
This is not a reward. It’s a prediction error in the Free Energy sense: the system expects to be near the ball (because that’s where interesting things happen), and any deviation from that expectation creates a signal to act.
The critical difference from reward shaping: TPE doesn’t tell the dog what to do. It tells the dog how surprised it should be.
Change 2: Vision Boost
The TPE signal alone wasn’t enough. The SNN has 80+ input neurons, and the ball-related inputs (distance, angle) were getting drowned out by proprioceptive signals — joint angles, velocities, IMU readings. The network couldn’t hear the ball over the noise of its own body.
The fix: when TPE exceeds a threshold (0.05), the last 16 input neurons — the ones carrying sensory/environmental information — get amplified by TPE × 0.5. Higher prediction error means louder sensory input.
This mirrors how biological attention works: when something is unexpected, sensory cortex activity increases. The salience of the stimulus goes up proportional to how wrong your predictions are.
The effect was immediate. The SNN started responding to ball distance changes within the first 10k steps.
Change 3: R-STDP Sign Fix
This was the most embarrassing bug. The R-STDP learning rule combines reward and prediction error:
combined = 0.1 × reward + 0.9 × (−PE)
The minus sign on PE is critical. When the dog approaches the ball, PE decreases (less surprise). The negative of a decreasing value is positive — which means approaching creates positive reinforcement for the synapses that were active during that movement.
The original code had the sign flipped. Approaching the ball was punishing the very synapses that caused the approach. The SNN was literally learning to avoid the ball.
One minus sign. Weeks of debugging.
Change 4: Ball Curriculum
Even with correct gradients, dropping a ball 3 meters away at a random angle is too hard for a system that just learned to walk. The solution: a 5-stage curriculum.
Stage 1 starts the ball at 1.5 meters, directly ahead (0° angle). The dog barely has to turn — just walk forward. When ball_dist_min drops below 0.5 meters, the curriculum advances.
Each stage increases distance and angle: (1.5m, 0°) → (2.0m, 17°) → (2.5m, 23°) → (2.7m, 28°) → (3.0m, 34°).
In the 100k-step run, the dog advanced through two stages. It mastered straight-ahead approach, then learned to turn slightly before approaching. The curriculum let the SNN build on what it already knew.
Results
The numbers from the run:
- 0.8 cm minimum ball distance — the dog essentially touched it
- 294 contact frames — sustained interaction, not a single bump
- 0 falls in 100k steps — stable locomotion throughout
- 47 ball contact episodes across 5 curriculum stages
- CPG at 40% — the dog was trotting, not sprinting
The 10-seed ablation study confirmed this wasn’t a fluke. Configuration B (SNN + Cerebellum) outperforms the PPO baseline by 3.5× on ball approach metrics, with significantly lower variance.
What This Means
This is not a robot dog playing fetch. It’s a proof of concept for something deeper: a biologically grounded system that develops goal-directed behavior through prediction error minimization, not through reward engineering.
The dog doesn’t get a treat for touching the ball. It touches the ball because touching the ball reduces prediction error. The ball is interesting because the system expects it to be interesting — and the Free Energy framework turns that expectation into action.
Four changes. One minus sign. A robot dog that learned to care about a ball.
MH-FLOCKE is an independent research project by Marc Hesse in Potsdam, Germany. The system runs on a Unitree Go2 quadruped in MuJoCo simulation, using spiking neural networks, a cerebellar forward model, and central pattern generators.
Watch the full run: YouTube Video #3 · Read the paper: aiXiv
