算法1:蒙特卡洛Q-learning算法

1) 初始化矩阵使 Q 2 ( 0 ) = I d

2) for t = 0 , T :

3) Q 2 ( t + 1 ) = Q 2 ( t ) + α [ 1 s k = 1 s [ N k + γ Λ k T Π ( Q 2 ( t ) ) Λ k ] Q 2 ( t ) ]

4) K 0 ( t + 1 ) = 1 1 γ Γ 0 ( Q 2 ( t + 1 ) , 0 )

5) t = t + 1

6) end for