You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Ensure proper epsilon decay by verifying correct division by 1.1, initialization, data types, and episode end triggers. Adjust decay rate if necessary .
The intention is to refine the state-action values of an epsilon-greedy policy toward the optimal policy (it won't become optimal because it's a soft policy). The requirement is to use a soft policy that approximates to the optimal greedy policy over its action-state values. The epsilon-greedy policy satisfies that requirement, even with a constant epsilon.
Although in a real world scenario an epsilon value with a decay would normally be better (especially in stationary environments, like the environment used in the exercise, blackjack), there's no need for use decay in this exercise. Actually, I think it's better to not include decay here, because in the book (Chapter 5) it specifies just an epsilon-greedy policy without decay, so it conforms more with the book, and focuses more on the control algorithm itself, instead of the possible policies that could be used (like Decay Schedules for 𝜖, Upper Confidence Bound (UCB), Boltzmann Exploration (Softmax), etc.), even if they would be a better fit and converge faster into the optimal policy.
At end of each episode ,there should be epsilion=epsilon/1.1
The text was updated successfully, but these errors were encountered: