Machine Learning methodology to optimize actions taken over states by learning the -function . This is done by:
- Initialising the -table , possibly with random values
- Updating with by taking an exploratory action where:
- is our learning rate
- is our immediate reward, built as an estimate of how far we are from our target.
- is the time based discount applied to our rewards.
- is our previous estimate of the -function.
- is our state after we took a potentially probabilistic action , weighted towards actions with higher expected -value.
It is optimal to let our choice of have high Entropy at first, and as we have a better estimate of -values, we bias our choice of towards high reward actions.