Multi-Armed Bandit: Step Size Comparison

Problem: In reinforcement learning, how should an agent update its value estimates? This project compares constant and non-constant step sizes in the multi-armed bandit problem.

The Update Rule

Q(a) ← Q(a) + α × [Reward - Q(a)]

Where α is the step size that determines how much weight we give to new information.

📊 Non-Constant (Sample Average)

α = 1/n

where n is the number of times action was taken

Decreases over time
All observations equally weighted
Computes true average

🎯 Constant Step Size

α = 0.1 (or another fixed value)

Stays fixed over time
Recent rewards weighted more
Exponential recency-weighted average

Experimental Setup

Bandit: 10 arms with Gaussian rewards
Strategy: ε-greedy (ε = 0.1) action selection
Trials: 2000 independent runs × 1000 steps each
Conditions: Stationary and non-stationary environments

Key Findings

✅ Stationary Problems (Rewards Don't Change)

Sample average performs best - converges to true values
Constant step sizes work well but may converge slightly slower
All methods eventually reach similar performance

🔄 Non-Stationary Problems (Rewards Drift Over Time)

Constant α = 0.1 significantly outperforms sample average
Sample average struggles because old data drags down estimates
Higher constant step sizes adapt faster to changes
The "forgetting" property becomes an advantage

Conclusion

The choice of step size matters! For stationary problems, sample averaging provides unbiased estimates. However, in non-stationary environments—which are common in real-world applications—constant step sizes excel by giving more weight to recent observations and naturally tracking changes over time.

Interested in This Research?

I'm passionate about advancing the field of activity recognition and wearable sensing technologies. If you'd like to discuss this research, explore potential collaborations, or have thoughtful comments and questions, I'd love to hear from you!

View My GitHub Contact Me