算法1:蒙特卡洛Q-learning算法
1) 初始化矩阵使 Q 2 ( 0 ) = I d
2) for t = 0 , T :
3) Q 2 ( t + 1 ) = Q 2 ( t ) + α [ 1 s ∑ k = 1 s [ N k + γ Λ k T Π ( Q 2 ( t ) ) Λ k ] − Q 2 ( t ) ]
4) K 0 ( t + 1 ) = 1 1 − γ Γ 0 ( Q 2 ( t + 1 ) , 0 )
5) t = t + 1
6) end for