Machine Learning methodology to optimize actions taken over states by learning the -function . This is done by:

  1. Initialising the -table , possibly with random values
  2. Updating with by taking an exploratory action where:
    1. is our learning rate
    2. is our immediate reward, built as an estimate of how far we are from our target.
    3. is the time based discount applied to our rewards.
    4. is our previous estimate of the -function.
    5. is our state after we took a potentially probabilistic action , weighted towards actions with higher expected -value.

It is optimal to let our choice of have high Entropy at first, and as we have a better estimate of -values, we bias our choice of towards high reward actions.