I'm having a little trouble doing the last part of the problem set.
I read through the Tesauro reading, and I am still confused about
how to code the TD back propagation.
You should also read the Sutton paper also, since that actually tries
to explain TD in more detail. In particular it tells you what TD(0)
is, which is just using the plain old gradient, with no weighting with
lambda.
Are we supposed to train it with probability 0 everytime it loses,
and with probability 1 everytime it wins?
Yes. But you also need to train it at every move. That's the whole
point of TD, at every board position if you ran the net on that board
position you would get the current probability estimate (call that
P(t)). You then use the current net to pick the move that has
the best probability of winning (as estimated by the net or which may
be 1 or 0 if the best move wins or loses), that's P(t+1). Then you
want to do one step of backprop using P(t+1) as the target for the
net. This has the effect of moving the net's estimate for the current
board closer to P(t+1).
This is in fact very simple to code (approx 10-15 lines).
Also, what should I use for the rate and momentume.
Is that what alpha is for?
Alpha is intended to be the learning rate, e.g. 0.1, use 0 momentum.
Let me know if that's obscure.
Tomas