Machine Learning methodology to optimize actions taken over states by learning the -function . This is done by:

  1. Initialising the -table , possibly with random values
  2. Updating with by taking an exploratory action where:
    1. is our learning rate
    2. is our immediate reward, built as an estimate of how far we are from our target.
    3. is the time based discount applied to our rewards.
    4. is our previous estimate of the -function.
    5. is our state after we took a potentially probabilistic action , weighted towards actions with higher expected -value.

It is optimal to let our choice of have high Entropy at first, and as we have a better estimate of -values, we bias our choice of towards high reward actions.

Info

E-learning can be seen as Dynamic programming with recursive dependencies. The true value of each entry in the dynamical programming table is done through multiple passes of the table, with the end result converging to the true value.