Multi-Armed Bandit: Step Size Comparison

Problem: In reinforcement learning, how should an agent update its value estimates? This project compares constant and non-constant step sizes in the multi-armed bandit problem.

The Update Rule

Q(a) ← Q(a) + α × [Reward - Q(a)]

Where α is the step size that determines how much weight we give to new information.

📊 Non-Constant (Sample Average)

α = 1/n

where n is the number of times action was taken

  • Decreases over time
  • All observations equally weighted
  • Computes true average

🎯 Constant Step Size

α = 0.1 (or another fixed value)

  • Stays fixed over time
  • Recent rewards weighted more
  • Exponential recency-weighted average

Experimental Setup

Key Findings

✅ Stationary Problems (Rewards Don't Change)

🔄 Non-Stationary Problems (Rewards Drift Over Time)

Conclusion

The choice of step size matters! For stationary problems, sample averaging provides unbiased estimates. However, in non-stationary environments—which are common in real-world applications—constant step sizes excel by giving more weight to recent observations and naturally tracking changes over time.

Interested in This Research?

I'm passionate about advancing the field of activity recognition and wearable sensing technologies. If you'd like to discuss this research, explore potential collaborations, or have thoughtful comments and questions, I'd love to hear from you!

View My GitHub Contact Me