Multi-Armed Bandit: Step Size Comparison
Problem: In reinforcement learning, how should an agent update its value estimates?
This project compares constant and non-constant step sizes in the multi-armed bandit problem.
The Update Rule
Q(a) ← Q(a) + α × [Reward - Q(a)]
Where α is the step size that determines how much weight we give to new information.
📊 Non-Constant (Sample Average)
α = 1/n
where n is the number of times action was taken
- Decreases over time
- All observations equally weighted
- Computes true average
🎯 Constant Step Size
α = 0.1 (or another fixed value)
- Stays fixed over time
- Recent rewards weighted more
- Exponential recency-weighted average
Experimental Setup
- Bandit: 10 arms with Gaussian rewards
- Strategy: ε-greedy (ε = 0.1) action selection
- Trials: 2000 independent runs × 1000 steps each
- Conditions: Stationary and non-stationary environments
Key Findings
✅ Stationary Problems (Rewards Don't Change)
- Sample average performs best - converges to true values
- Constant step sizes work well but may converge slightly slower
- All methods eventually reach similar performance
🔄 Non-Stationary Problems (Rewards Drift Over Time)
- Constant α = 0.1 significantly outperforms sample average
- Sample average struggles because old data drags down estimates
- Higher constant step sizes adapt faster to changes
- The "forgetting" property becomes an advantage
Conclusion
The choice of step size matters! For stationary problems, sample averaging provides
unbiased estimates. However, in non-stationary environments—which are common in real-world
applications—constant step sizes excel by giving more weight to recent observations and
naturally tracking changes over time.
Interested in This Research?
I'm passionate about advancing the field of activity recognition and wearable sensing technologies.
If you'd like to discuss this research, explore potential collaborations, or have thoughtful comments and questions,
I'd love to hear from you!
View My GitHub
Contact Me