Neural networks learning spatial reasoning through optimal path imitation
This project implements several custom techniques to address unique challenges in teaching neural networks to navigate mazes. The focus is on practical solutions that emerged from iterative development across multiple versions.
Standard approach would be a single channel maze representation. This project uses a dual-channel input where Channel 0 encodes the grid structure and Channel 1 provides normalized Manhattan distance to the goal at every position. This gives the model explicit directional guidance without hard-coding pathfinding logic.
Rather than filtering invalid actions after prediction, we apply -10.0 penalties to logits before softmax for any action that would hit a wall or exit the grid. This approach:
Models exhibit oscillation in corridors (UP→DOWN→UP→DOWN) because they don't track recent history. We apply a -3.0 penalty to the opposite of the last action taken. The penalty is smaller than the wall penalty because backtracking is sometimes necessary (dead ends).
A* paths from top-left to bottom-right naturally contain ~40-50% DOWN/RIGHT moves but only ~5-10% UP/LEFT. Naive training on this distribution causes severe directional bias. Our solution:
Visualizing CNN activations requires thresholding, but absolute thresholds fail across different maze configurations. We use 95th and 85th percentile thresholds computed per-inference, highlighting the top 5% and top 15% of activations regardless of absolute magnitude.
Training in the browser requires different tradeoffs than server-side training:
The same codebase supports both CNN and Transformer models. Key implementation details:
Both architectures solve the same problem but use completely different approaches to understanding spatial structure.
How it thinks: "I see images of mazes and learn local spatial patterns."
How it thinks: "Each cell is a token, and I learn which cells relate to each other."
We use 3×3 convolutional kernels specifically because that's the minimum size to detect corridors and turns. A 1×1 kernel can't see neighbors (no spatial context). A 5×5 kernel sees too much at once on a 10×10 grid (half the maze in the receptive field of the first layer). 3×3 provides exactly one step of lookahead in each direction.
One layer can only see 3×3 neighborhoods. Two layers see 5×5 neighborhoods (3×3 of 3×3s). This is critical for detecting patterns like "T-junctions" where the best move depends on context beyond immediate neighbors.
Standard practice is to double filter counts each layer. We stick with this not for theoretical reasons but because it empirically works. Tried 16→16 (worse, not enough capacity), 16→64 (overfits, too many parameters), 8→16 (underfits).
We use padding='same' on both conv layers to maintain 10×10 spatial dimensions throughout. Alternative 'valid' padding shrinks to 8×8 then 6×6, losing edge information where many interesting decisions happen (corners of the maze).
Grid values (0-5) are normalized by dividing by 5.0, putting them in [0, 1] range. Distance values are already normalized during generation (distance / 20). This ensures both channels have similar magnitudes.
We extract from conv2 (not conv1) because conv1 activations are too low-level (just edges). Conv2 shows "decision-relevant features." The extraction process:
What you'll typically see highlighted:
This visualization makes the "black box" neural network more interpretable. Instead of just seeing "the model chose RIGHT," you can see exactly which parts of the maze influenced that decision.
Before the neural network can learn, we need a "teacher" to show it what good navigation looks like. We use the A* (A-star) pathfinding algorithm, a classic AI algorithm that's guaranteed to find the shortest path.
Imagine you're exploring a maze with a cost tracker and a compass:
A* always expands the position with the lowest f-score next, balancing between paths that are already short and paths that point toward the goal. This guarantees finding the optimal path.
Once A* finds the optimal path from start to goal, we create one training sample for each step along the path. For example, if the path goes from position (2,3) to (2,4), we create a sample with input = maze state at (2,3) and output = RIGHT action.
We generate 200 random mazes and run A* on each one, typically producing around 3,000-4,000 training samples total.
Since mazes are generated with the start in the top-left and goal in the bottom-right, optimal paths naturally contain far more DOWN and RIGHT moves than UP and LEFT. If we train on this imbalanced data, the model learns a simple but wrong strategy: "Just always go down-right and you'll usually be correct!"
| UP: | 200 samples (5%) |
| DOWN: | 1500 samples (40%) |
| LEFT: | 300 samples (8%) |
| RIGHT: | 1700 samples (47%) |
Heavily biased toward DOWN/RIGHT
| UP: | 900 samples (25%) |
| DOWN: | 900 samples (25%) |
| LEFT: | 900 samples (25%) |
| RIGHT: | 900 samples (25%) |
Perfectly balanced distribution
Result: The model now sees each direction with equal frequency during training, learning to navigate in all directions rather than just memorizing "go down-right."
With our balanced dataset ready, we train the model using supervised learning. The network sees thousands of examples of "in this maze state, the optimal action is X" and gradually learns the patterns.
This process repeats for all batches across 20 epochs. You'll see the loss decrease and accuracy increase as the model learns the navigation patterns. Typically, the CNN converges around epoch 15-20 with 85-95% accuracy.
After training, when we test the model on a new maze, it follows these steps at each position:
Raw neural network predictions aren't perfect. Sometimes the model predicts a move into a wall or suggests backtracking. We add two intelligent layers that fix these issues without retraining:
Problem: The model sometimes predicts moving into a wall or off the grid edge.
Solution: Before selecting an action, we check which moves are physically valid. For any action that would hit a wall or go out of bounds, we apply a large penalty (-10.0) to its probability score.
Example:
Raw: [UP: 0.25, DOWN: 0.40, LEFT: 0.20, RIGHT: 0.15]
Wall to the left detected!
After: [UP: 0.25, DOWN: 0.40, LEFT: -9.80, RIGHT: 0.15]
→ DOWN selected (highest valid score)
Problem: In narrow corridors, the model sometimes oscillates back and forth (UP → DOWN → UP → DOWN...).
Solution: We track the last action taken. If the model considers the opposite direction (UP↔DOWN or LEFT↔RIGHT), we apply a moderate penalty (-3.0) to discourage but not forbid backtracking.
Example:
Last move: UP
Raw: [UP: 0.10, DOWN: 0.45, LEFT: 0.25, RIGHT: 0.20]
After: [UP: 0.10, DOWN: -2.55, LEFT: 0.25, RIGHT: 0.20]
→ LEFT selected (avoids backtracking to DOWN)
These two techniques dramatically improve navigation success rate without requiring additional training data or model changes. They're like "safety rails" that guide the neural network's decisions.
Note: The Transformer is experimental and trains much slower than the CNN. However, its attention mechanism provides fascinating insights into how models learn spatial relationships.
Unlike CNNs that use fixed 3×3 windows, transformers use attention mechanisms to dynamically decide which parts of the input to focus on. Think of attention as answering the question: "When making a decision at position A, which other positions should I pay attention to?"
Imagine each maze cell as a student in a classroom. The attention mechanism works like this:
The attention score between two positions measures their relevance: high scores mean "these positions should influence each other," low scores mean "these positions are unrelated for this decision."
Instead of one attention mechanism, we use four parallel "heads", each learning different types of relationships. For example, one head might focus on the current agent position, another on the goal location, another on nearby obstacles, and another on corridor structure. These specializations are not explicitly programmed but emerge during training.
During testing with the Transformer, the yellow and blue highlights show attention weights averaged across all positions and heads. This reveals which cells the model considers important for navigation decisions.
High attention usually concentrates on the agent, goal, and key decision points. Unlike CNN activations (which are feature-based), attention weights directly show relational importance between positions.