算法2:随机逼近Q-learning算法

1) 初始化矩阵使 Q 2 ( 0 ) = I d

2) for t = 0 , T :

3) Q 2 ( t + 1 ) = Q 2 ( t ) + α ( N t + 1 + γ Λ t + 1 T Π ( Q 2 ( t ) ) Λ t + 1 Q 2 ( t ) )

4) K 0 ( t + 1 ) = 1 1 γ Γ 0 ( Q 2 ( t + 1 ) , 0 )

5) t = t + 1

6) end for