Maze Navigation with Deep Learning

Project-Specific Innovations

This project implements several custom techniques to address unique challenges in teaching neural networks to navigate mazes. The focus is on practical solutions that emerged from iterative development across multiple versions.

Two-Channel Input Encoding

Standard approach would be a single channel maze representation. This project uses a dual-channel input where Channel 0 encodes the grid structure and Channel 1 provides normalized Manhattan distance to the goal at every position. This gives the model explicit directional guidance without hard-coding pathfinding logic.

Why this matters: Without the distance channel, models struggle to distinguish between "move toward goal" and "avoid walls" - they often find local corridors but fail to make progress globally. The distance gradient solves this by providing a continuous signal pointing toward the goal.

Action Masking with Logit Penalties

Rather than filtering invalid actions after prediction, we apply -10.0 penalties to logits before softmax for any action that would hit a wall or exit the grid. This approach:

Preserves differentiability (unlike hard masking)
Works with any model architecture without retraining
Effectively eliminates invalid moves (e^-10 ≈ 0.000045 probability)

Implementation detail: The penalty is applied after the model outputs but before argmax selection. This means the model never learns that walls exist - we simply prevent it from making impossible moves at inference time. Alternative approaches like training with invalid move penalties pollute the loss signal and slow convergence.

Anti-Backtracking Penalty

Models exhibit oscillation in corridors (UP→DOWN→UP→DOWN) because they don't track recent history. We apply a -3.0 penalty to the opposite of the last action taken. The penalty is smaller than the wall penalty because backtracking is sometimes necessary (dead ends).

Design choice: We tried maintaining a full path history and penalizing revisited cells, but this caused the model to get stuck in dead ends. The single-step memory (last action only) provides just enough bias against oscillation without preventing escape from dead ends.

Median-Based Dataset Balancing

A* paths from top-left to bottom-right naturally contain ~40-50% DOWN/RIGHT moves but only ~5-10% UP/LEFT. Naive training on this distribution causes severe directional bias. Our solution:

Group samples by action (UP, DOWN, LEFT, RIGHT)
Calculate median group size (not mean, to handle outliers)
Oversample minority classes with replacement
Subsample majority classes without replacement
Cap at 1000 samples per class to prevent overfitting on repeated examples

Why not weighted loss? We tried class-weighted cross-entropy (w_DOWN = 0.5, w_UP = 5.0) but found it destabilizes training. The model oscillates between "always predict UP" and "always predict DOWN" depending on which minority class gets high weight. Resampling provides stable 25% per class without loss function modifications.

Percentile-Based Activation Visualization

Visualizing CNN activations requires thresholding, but absolute thresholds fail across different maze configurations. We use 95th and 85th percentile thresholds computed per-inference, highlighting the top 5% and top 15% of activations regardless of absolute magnitude.

Why percentiles? Fixed thresholds either show nothing (threshold too high) or everything (threshold too low). Percentiles adapt automatically: sparse activations on empty mazes get highlighted, but only the most important regions are shown on complex mazes. This matches how attention visualization works in transformer papers.

Browser-Specific Optimizations

Training in the browser requires different tradeoffs than server-side training:

Reduced dataset size: 3,600 samples (vs 12,000 in Python version) to fit in browser memory and train in <3 minutes
Smaller batch sizes: 32 (vs 64-128 typical) because WebGPU has stricter memory limits than CUDA
No validation split: We use all data for training since the dataset is tiny and we track test performance separately
Lower learning rate: 0.001 (vs 0.01 initially) because browser training is less stable than Python/PyTorch

Dual Architecture Support

The same codebase supports both CNN and Transformer models. Key implementation details:

CNN input: Reshaped to [batch, 10, 10, 2] four-dimensional tensor
Transformer input: Flattened to [batch, 100] two-dimensional tensor (grid only, no distance channel yet)
Model disposal: Switching architectures disposes the old model to prevent memory leaks
Visualization switch: Legend text updates to show "Activation" for CNN or "Attention" for Transformer

Model Architectures Explained

Both architectures solve the same problem but use completely different approaches to understanding spatial structure.

CNN (Recommended)

How it thinks: "I see images of mazes and learn local spatial patterns."

10×10×2 maze image
↓
First Conv Layer (16 filters detect basic patterns)
↓
Second Conv Layer (32 filters combine patterns)
↓
Flatten & Dense (decision making)
↓
4 action probabilities

Strengths:

Fast training: 2-3 minutes
Lightweight: ~14K parameters
Natural for grid-based tasks
High maze success rate

Transformer (Experimental)

How it thinks: "Each cell is a token, and I learn which cells relate to each other."

100 position tokens (flattened maze)
↓
Embedding Layer (project to 64 dims)
↓
Positional Encoding (add position info)
↓
Multi-Head Attention (4 heads learn relationships)
↓
Feed-Forward + Pooling
↓
4 action probabilities

Trade-offs:

Slower training: 10+ minutes
Heavier: ~60K parameters
Browser performance limits
Interesting attention patterns

Implementation Details & Design Decisions

Why 3×3 Kernels?

We use 3×3 convolutional kernels specifically because that's the minimum size to detect corridors and turns. A 1×1 kernel can't see neighbors (no spatial context). A 5×5 kernel sees too much at once on a 10×10 grid (half the maze in the receptive field of the first layer). 3×3 provides exactly one step of lookahead in each direction.

Tested alternatives: We tried 5×5 kernels thinking more context = better, but the model became overconfident and made worse decisions. With 3×3, the model learns "if there's a wall one step ahead, don't go that way" which generalizes better than "if there's a wall anywhere in this 5×5 region, avoid this general direction."

Why Two Conv Layers?

One layer can only see 3×3 neighborhoods. Two layers see 5×5 neighborhoods (3×3 of 3×3s). This is critical for detecting patterns like "T-junctions" where the best move depends on context beyond immediate neighbors.

Layer 1: Detects edges, corners, walls (local features)
Layer 2: Combines into "corridor types", "junction shapes" (compositional features)

Why not three layers? Tried it - training became unstable and the model actually performed worse. The hypothesis is that three layers see 7×7 regions, which is too large for a 10×10 grid. The model started memorizing specific maze layouts instead of learning general navigation principles.

Filter Count Progression (16 → 32)

Standard practice is to double filter counts each layer. We stick with this not for theoretical reasons but because it empirically works. Tried 16→16 (worse, not enough capacity), 16→64 (overfits, too many parameters), 8→16 (underfits).

Padding Strategy

We use padding='same' on both conv layers to maintain 10×10 spatial dimensions throughout. Alternative 'valid' padding shrinks to 8×8 then 6×6, losing edge information where many interesting decisions happen (corners of the maze).

Normalization Scheme

Grid values (0-5) are normalized by dividing by 5.0, putting them in [0, 1] range. Distance values are already normalized during generation (distance / 20). This ensures both channels have similar magnitudes.

Why not batch normalization? We tried adding batch norm after each conv layer. Performance actually got worse, likely because batch norm destroys the absolute meaning of values. The model needs to know "this is definitely a wall (value=1)" not "this is above/below the batch average."

Activation Extraction for Visualization

We extract from conv2 (not conv1) because conv1 activations are too low-level (just edges). Conv2 shows "decision-relevant features." The extraction process:

Create intermediate model: inputs → conv2.output
Run inference, get [1, 10, 10, 32] tensor
Average across 32 filters: activation[r][c] = mean(tensor[0][r][c][:])
Compute 95th/85th percentiles of the 100 averaged values
Overlay yellow/blue on cells above thresholds

Why average filters instead of max pooling? Max pooling shows "any filter firing strongly" but we want "consensus across filters." Averaging reveals cells where multiple different feature detectors agree this location is important, which better correlates with actual model decisions.

How We Extract Activations:

Run the maze state through the network up to the second convolutional layer (conv2)
Extract the output: a 10×10×32 tensor (32 different feature maps)
Average across all 32 filters to get a single 10×10 "importance map"
Calculate percentile thresholds (95th and 85th percentile)
Highlight cells: Yellow = top 5%, Blue = top 15%

What you'll typically see highlighted:

Agent's current position - "I'm here"
The goal location - "I need to get there"
Corridor junctions - "Important decision points"
Nearby walls - "Obstacles to avoid"
Open paths toward goal - "Promising routes"

This visualization makes the "black box" neural network more interpretable. Instead of just seeing "the model chose RIGHT," you can see exactly which parts of the maze influenced that decision.

The Training Process: Teaching Neural Networks to Navigate

Step 1: Generating Perfect Examples with A*

Before the neural network can learn, we need a "teacher" to show it what good navigation looks like. We use the A* (A-star) pathfinding algorithm, a classic AI algorithm that's guaranteed to find the shortest path.

How A* Works (The Intuition):

Imagine you're exploring a maze with a cost tracker and a compass:

g-score: "How many steps have I taken from the start?"
h-score (heuristic): "How many steps in a straight line to the goal?" (Manhattan distance)
f-score: g + h = "Total estimated cost of this path"

A* always expands the position with the lowest f-score next, balancing between paths that are already short and paths that point toward the goal. This guarantees finding the optimal path.

Creating Training Samples:

Once A* finds the optimal path from start to goal, we create one training sample for each step along the path. For example, if the path goes from position (2,3) to (2,4), we create a sample with input = maze state at (2,3) and output = RIGHT action.

We generate 200 random mazes and run A* on each one, typically producing around 3,000-4,000 training samples total.

Step 2: Fixing the Direction Bias Problem

The Problem: Natural Bias in Maze Data

Since mazes are generated with the start in the top-left and goal in the bottom-right, optimal paths naturally contain far more DOWN and RIGHT moves than UP and LEFT. If we train on this imbalanced data, the model learns a simple but wrong strategy: "Just always go down-right and you'll usually be correct!"

Before Balancing

UP:	200 samples (5%)
DOWN:	1500 samples (40%)
LEFT:	300 samples (8%)
RIGHT:	1700 samples (47%)

Heavily biased toward DOWN/RIGHT

After Balancing

UP:	900 samples (25%)
DOWN:	900 samples (25%)
LEFT:	900 samples (25%)
RIGHT:	900 samples (25%)

Perfectly balanced distribution

The Balancing Strategy:

Group all training samples by their action (UP, DOWN, LEFT, RIGHT)
Find the median group size (in this case, 900)
For overrepresented directions (DOWN, RIGHT): randomly sample down to 900 examples
For underrepresented directions (UP, LEFT): randomly sample with replacement (duplicating examples) up to 900
Shuffle the final balanced dataset

Result: The model now sees each direction with equal frequency during training, learning to navigate in all directions rather than just memorizing "go down-right."

Step 3: Training the Neural Network

With our balanced dataset ready, we train the model using supervised learning. The network sees thousands of examples of "in this maze state, the optimal action is X" and gradually learns the patterns.

Training Configuration:

Epochs: 20 (one full pass through all training data = 1 epoch)
Batch Size: 32 (process 32 samples before updating weights)
Learning Rate: 0.001 (how aggressively to update weights, using Adam optimizer)
Loss Function: Categorical Cross-Entropy (measures prediction error for classification)

What Happens During Each Training Step:

Forward Pass: Feed a batch of 32 maze states through the network
Prediction: Network outputs probabilities for each action [UP, DOWN, LEFT, RIGHT]
Loss Calculation: Compare predictions to optimal actions, calculate error
Backpropagation: Calculate how to adjust each weight to reduce error
Update Weights: Apply small changes to make better predictions next time

This process repeats for all batches across 20 epochs. You'll see the loss decrease and accuracy increase as the model learns the navigation patterns. Typically, the CNN converges around epoch 15-20 with 85-95% accuracy.

Inference & Smart Navigation

From Training to Testing: How the Model Navigates

After training, when we test the model on a new maze, it follows these steps at each position:

Encode current maze state as 10×10×2 input
Run through the neural network
Get raw action probabilities [0.05, 0.70, 0.15, 0.10]
Apply smart navigation post-processing
Select action with highest adjusted probability
Move agent, update position, repeat

Smart Navigation Post-Processing

Raw neural network predictions aren't perfect. Sometimes the model predicts a move into a wall or suggests backtracking. We add two intelligent layers that fix these issues without retraining:

1. Valid Action Masking

Problem: The model sometimes predicts moving into a wall or off the grid edge.

Solution: Before selecting an action, we check which moves are physically valid. For any action that would hit a wall or go out of bounds, we apply a large penalty (-10.0) to its probability score.

Example:
Raw: [UP: 0.25, DOWN: 0.40, LEFT: 0.20, RIGHT: 0.15]
Wall to the left detected!
After: [UP: 0.25, DOWN: 0.40, LEFT: -9.80, RIGHT: 0.15]
→ DOWN selected (highest valid score)

2. Anti-Backtracking Penalty

Problem: In narrow corridors, the model sometimes oscillates back and forth (UP → DOWN → UP → DOWN...).

Solution: We track the last action taken. If the model considers the opposite direction (UP↔DOWN or LEFT↔RIGHT), we apply a moderate penalty (-3.0) to discourage but not forbid backtracking.

Example:
Last move: UP
Raw: [UP: 0.10, DOWN: 0.45, LEFT: 0.25, RIGHT: 0.20]
After: [UP: 0.10, DOWN: -2.55, LEFT: 0.25, RIGHT: 0.20]
→ LEFT selected (avoids backtracking to DOWN)

These two techniques dramatically improve navigation success rate without requiring additional training data or model changes. They're like "safety rails" that guide the neural network's decisions.

Understanding Transformer Attention

Note: The Transformer is experimental and trains much slower than the CNN. However, its attention mechanism provides fascinating insights into how models learn spatial relationships.

The Core Idea: Attention is "Learned Relevance"

Unlike CNNs that use fixed 3×3 windows, transformers use attention mechanisms to dynamically decide which parts of the input to focus on. Think of attention as answering the question: "When making a decision at position A, which other positions should I pay attention to?"

The Attention Mechanism (Intuition):

Imagine each maze cell as a student in a classroom. The attention mechanism works like this:

Query (Q): "I'm at this position. What should I look at?"
Key (K): "Here's what I (each cell) represent."
Value (V): "Here's the useful information I (each cell) can provide."

The attention score between two positions measures their relevance: high scores mean "these positions should influence each other," low scores mean "these positions are unrelated for this decision."

Multi-Head Attention (4 Heads):

Instead of one attention mechanism, we use four parallel "heads", each learning different types of relationships. For example, one head might focus on the current agent position, another on the goal location, another on nearby obstacles, and another on corridor structure. These specializations are not explicitly programmed but emerge during training.

Visualizing Attention: What is the Transformer Looking At?

During testing with the Transformer, the yellow and blue highlights show attention weights averaged across all positions and heads. This reveals which cells the model considers important for navigation decisions.

High attention usually concentrates on the agent, goal, and key decision points. Unlike CNN activations (which are feature-based), attention weights directly show relational importance between positions.

Maze Navigation with Deep Learning

Deep Learning Maze Navigation: Technical Guide

Project-Specific Innovations

Two-Channel Input Encoding

Action Masking with Logit Penalties

Anti-Backtracking Penalty

Median-Based Dataset Balancing

Percentile-Based Activation Visualization

Browser-Specific Optimizations

Dual Architecture Support

Model Architectures Explained

CNN (Recommended)

Transformer (Experimental)

Implementation Details & Design Decisions

Why 3×3 Kernels?

Why Two Conv Layers?

Filter Count Progression (16 → 32)

Padding Strategy

Normalization Scheme

Activation Extraction for Visualization

How We Extract Activations:

The Training Process: Teaching Neural Networks to Navigate

Step 1: Generating Perfect Examples with A*

How A* Works (The Intuition):

Creating Training Samples:

Step 2: Fixing the Direction Bias Problem

The Problem: Natural Bias in Maze Data

Before Balancing

After Balancing

The Balancing Strategy:

Step 3: Training the Neural Network

Training Configuration:

What Happens During Each Training Step:

Inference & Smart Navigation

From Training to Testing: How the Model Navigates

Smart Navigation Post-Processing

1. Valid Action Masking

2. Anti-Backtracking Penalty

Understanding Transformer Attention

The Core Idea: Attention is "Learned Relevance"

The Attention Mechanism (Intuition):

Multi-Head Attention (4 Heads):

Visualizing Attention: What is the Transformer Looking At?