算法2:随机逼近Q-learning算法
1) 初始化矩阵使 Q 2 ( 0 ) = I d
2) for t = 0 , T :
3) Q 2 ( t + 1 ) = Q 2 ( t ) + α ( N t + 1 + γ Λ t + 1 T Π ( Q 2 ( t ) ) Λ t + 1 − Q 2 ( t ) )
4) K 0 ( t + 1 ) = 1 1 − γ Γ 0 ( Q 2 ( t + 1 ) , 0 )
5) t = t + 1
6) end for