小方格世界的DP、Q-learning、sarsa和MC算法

1 小方格世界的MDP及动态规划

1.1 小方格世界的MDP模型

# 模型参考自《强化学习入门-从原理到实践》叶强等著
# 0,15为终止状态,reward为0;其余状态的reward=-1S = [i for i in range(16)]
A = ['上','下','左','右']
ds_action = {'上':-4,'下':4,'左':-1,'右':1}def dynamics(s,a):    s_prime = sif (s<4 and a=='上') or (s>11 and a=='下') or (s%4==0 and a=='左') or ( (s+1)%4==0 and a=='右') \or s in [0,15]:passelse:s_prime = s+ds_action[a]reward = 0 if s in [0,15] else -1            # 注:设置离开时,获得rewarddone = True if s in [0,15] else False  return s_prime, reward, donedef P(s1,a,s2):s_prime,_,_ = dynamics(s1,a)return s2==s_primedef R(s,a):_,reward,_ = dynamics(s,a)return rewardgamma = 0.95MDP = (S,A,P,R,gamma)def display_V(V):for i in range(len(V)):print(f'{V[i]:10.2f}', end=' ')if (i+1)%4 == 0:print('')

1.2 动态规划(值迭代)

贝尔曼期望方程:
v π ( s ) = ∑ a ∈ A π ( a ∣ s ) q π ( s , a ) v^{\pi}(s) = \sum\limits_{a \in A}\pi(a|s)q^{\pi}(s,a) vπ(s)=aAπ(as)qπ(s,a)
q π ( s , a ) = R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) v π ( s ′ ) q^{\pi}(s,a) = R(s,a) + \gamma \sum\limits_{s' \in S} P(s'|s,a)v^{\pi}(s') qπ(s,a)=R(s,a)+γsSP(ss,a)vπ(s)
贝尔曼最优方程:
v ∗ ( s ) = m a x a q ∗ ( s , a ) v^{*}(s) = \mathop{max}\limits_{a}q^{*}(s,a) v(s)=amaxq(s,a)
q ∗ ( s , a ) = R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) v ∗ ( s ′ ) q^{*}(s,a) = R(s,a) + \gamma \sum\limits_{s' \in S} P(s'|s,a)v^{*}(s') q(s,a)=R(s,a)+γsSP(ss,a)v(s)

# 利用贝尔曼最优方程,进行价值迭代# 计算 q(s,a)
def Qsa(MDP, V, s, a):_,_,_,_,gamma = MDPs_prime,reward,done = dynamics(s,a)q_sa = reward + gamma*V[s_prime]          # 某个动作后的状态唯一return q_sa# 计算 v(s)
def Vs(MDP, V, s):_,A,_,_,_ = MDPq_sa = -float('inf')for a in A:q_sa = max(q_sa,Qsa(MDP,V,s,a))vs = q_sareturn vs
# 迭代函数def V_dp(MDP, V, display=True, iters=1):S,_,_,_,_ = MDPfor i in range(iters):for s in S:vs = Vs(MDP, V, s)V[s] = vsif display:print(f'第 {i+1} 回合迭代结果:')display_V(V)
# 训练过程V = [0 for _ in range(16)]
V_dp(MDP, V, iters=5)

输出:
第 1 回合迭代结果:
0.00 -1.00 -1.00 -1.00
-1.00 -1.00 -1.00 -1.00
-1.00 -1.00 -1.00 -1.00
-1.00 -1.00 -1.00 0.00
第 2 回合迭代结果:
0.00 -1.00 -1.95 -1.95
-1.00 -1.95 -1.95 -1.95
-1.95 -1.95 -1.95 -1.00
-1.95 -1.95 -1.00 0.00
第 3 回合迭代结果:
0.00 -1.00 -1.95 -2.85
-1.00 -1.95 -2.85 -1.95
-1.95 -2.85 -1.95 -1.00
-2.85 -1.95 -1.00 0.00
第 4 回合迭代结果:
0.00 -1.00 -1.95 -2.85
-1.00 -1.95 -2.85 -1.95
-1.95 -2.85 -1.95 -1.00
-2.85 -1.95 -1.00 0.00
第 5 回合迭代结果:
0.00 -1.00 -1.95 -2.85
-1.00 -1.95 -2.85 -1.95
-1.95 -2.85 -1.95 -1.00
-2.85 -1.95 -1.00 0.00

2 MDP问题的Q-learning、sarsa和MC(Monte Carlo)算法

0 是起始状态,0-22的奖励为-1;23是终止状态,奖励为0。形如:
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ] \begin{bmatrix} 0 & 1 & 2 & 3 \\ 4 & 5 & 6 & 7 \\ 8 & 9 & 10 & 11 \\ 12 & 13 & 14 & 15 \\ 16 & 17 & 18 & 19 \\ 20 & 21 & 22 & 23 \\ \end{bmatrix} 04812162015913172126101418223711151923
问题:找到从0到23的最短路径?

# 问题的模型。其中,S是状态,[0,1,3,...,23]表示24个状态
S = [i for i in range(24)]# 动作A = ['上','下','左','右']
A = [0, 1, 2, 3]
ds_action = {0:-4, 1:4, 2:-1, 3:1}def step(s,a):    s_prime = sif (s<4 and a==0) or (s>19 and a==1) or (s%4==0 and a==2) or ( (s+1)%4==0 and a==3) \or s in [23]:passelse:s_prime = s+ds_action[a]reward = 0 if s in [23] else -1done = True if s in [23] else Falsereturn s_prime, reward, donedef display_V(V):for i in range(len(V)):print(f'{V[i]:10.2f}', end=' ')if (i+1)%4 == 0:print('')"""
A = [0,1,2,3]:
向上移动: 0   "↑"
向右移动: 1   "↓"
向下移动: 2   "←"
向左移动: 3   "→"
"""
# 取Q表中,每个状态最大值对应的索引 Q_argmax = Q.argmax(axis=1).tolist(),然后作箭头图
def arrow_Q(Q_argmax):for i in range(len(Q_argmax)):if Q_argmax[i] == 0:print("↑",end='  ')if Q_argmax[i] == 1:print("→",end='  ')if Q_argmax[i] == 2:print("↓",end='  ')if Q_argmax[i] == 3:print("←",end='  ')if (i+1)%4 == 0:print('')print('')

2.1 Q-learning

Q-learning的算法流程如下图所示(引自:laswT,https://www.zhihu.com/question/26408259/answer/123230350?utm_source=zhihu&utm_medium=social&utm_oi=1161946958133719040):
Q-Learning

# epsilon greedy 策略,返回A中的一个动作def epsilon_greedy_sa(Q, A, s, epsilon=0.1):rand_value = random.random()if rand_value max_q:max_q = qmax_a = [a]elif q == max_q:max_a.append(a)return random.choice(max_a)
# 进行一个回合的学习
def q_learning(Q, S, A, gamma=0.9, alpha=0.1, epsilon=0.1):steps = 0s0 = S[0]is_done = Falsewhile not is_done:a0 = epsilon_greedy_sa(Q, A, s0, epsilon)s1, r1, is_done = step(s0, a0)   # is_done是对S0状态的判断a1 = greedy_sa(Q, A, s1)q = Q[s0][a0]q_prime = Q[s1][a1]td_target = r1 + gamma*q_primeq_new = q + alpha*(td_target-q)Q[s0][a0] = q_news0 = s1steps += 1return steps 
Q = np.zeros((24,4))
all_steps = []for i in range(1000):steps = q_learning(Q, S, A)all_steps.append(steps)
V = Q.max(axis=1).tolist()
display_V(V)

输出:
-5.69 -5.22 -4.69 -4.10
-5.22 -4.69 -4.10 -3.44
-4.69 -4.10 -3.44 -2.71
-4.09 -3.44 -2.71 -1.90
-3.44 -2.71 -1.90 -1.00
-2.71 -1.90 -1.00 0.00

2.2 sarsa

sarsa的算法流程如下图所示(引自:laswT,https://www.zhihu.com/question/26408259/answer/123230350?utm_source=zhihu&utm_medium=social&utm_oi=1161946958133719040):
sarsa

# 进行一个回合的迭代def sarsa(Q, S, A, gamma=0.9, alpha=0.1, epsilon=0.1):steps = 0s0 = S[0]a0 = epsilon_greedy_sa(Q, A, s0, epsilon)done = Falsewhile not done:s1, r1, done = step(s0, a0)a1 = epsilon_greedy_sa(Q, A, s1)q = Q[s0][a0]q_prime = Q[s1][a1]td_target = r1 + gamma*q_primeq_new = q + alpha*(td_target-q)Q[s0][a0] = q_news0 = s1a0 = a1steps += 1return steps 
Q = np.zeros((24,4))
all_steps = []for i in range(1000):steps = sarsa(Q, S, A)all_steps.append(steps)
V = Q.max(axis=1).tolist()
display_V(V)

输出:
-5.82 -5.33 -4.82 -4.23
-5.33 -4.79 -4.29 -3.73
-4.78 -4.20 -3.58 -2.95
-4.20 -3.52 -2.83 -1.94
-3.55 -2.76 -1.93 -1.00
-2.75 -1.97 -1.00 0.00

2.3 MC算法

(1) 算法流程:

a. 基于给定策略 π \pi π,采样第k个完整的episode s 0 , a 1 , r 1 , . . . , s t {s_0,a_1,r_1,...,s_t} s0,a1,r1,...,st
b. 对于该状态序列中出现的每一个“状态-动作对” ( s t , a t ) (s_t,a_t) (st,at),更新其计数N和动作价值函数 Q Q Q
N ( s t , a t ) ← N ( s t , a t ) + 1 N(s_t,a_t)\gets N(s_t,a_t) + 1 N(st,at)N(st,at)+1
Q ( s t , a t ) ← Q ( s t , a t ) + 1 N ( s t , a t ) ( G t − Q ( s t , a t ) ) Q(s_t,a_t)\gets Q(s_t,a_t) + \frac{1}{N(s_t,a_t)}(G_t - Q(s_t,a_t)) Q(st,at)Q(st,at)+N(st,at)1(GtQ(st,at))
where, G t = R t + 1 + γ R t + 1 + . . . + γ T − 1 R T G_t = R_{t+1} + \gamma R_{t+1} + ... + \gamma^{T-1}R_T Gt=Rt+1+γRt+1+...+γT1RT
c. 基于新的动作价值函数Q,采取 ϵ − g r e e d y \epsilon-greedy ϵgreedy方式改进策略(动作的选择):
ϵ ← 1 / k \epsilon \gets 1/k ϵ1/k
π ← ϵ − g r e e d y \pi \gets \epsilon-greedy πϵgreedy
(2) 关于 ϵ \epsilon ϵ:在实际应用中, ϵ \epsilon ϵ的取值不局限于取 1 / k 1/k 1/k,只要符合GLIE特性的设计均可以收敛至最优策略。
(3) GLIE(Greedy in the Limit with Infinite Exploration-有限空间进行无限次探索)理论:
a. “状态-动作”对会被无限次探索:
l i m k → ∞ N k ( s , a ) = ∞ \mathop{lim}\limits_{k \rightarrow \infty} N_{k}(s,a) = \infty klimNk(s,a)=
b. 随着采样趋向无穷,策略收敛至一个贪婪策略:
l i m k → ∞ π k ( a ∣ s ) = 1 ( a = a r g m a x a ′ ∈ A Q k ( s , a ′ ) ) \underset{k \rightarrow \infty}{lim}\pi_{k}(a|s) =1\big( a = \underset{a' \in A}{argmax}Q_{k}(s,a')\big) klimπk(as)=1(a=aAargmaxQk(s,a))

# epsilon greedy 策略,返回A中的一个动作
def epsilon_greedy_sa(Q, A, s, epsilon=1.0):rand_value = random.random()if rand_value max_q:max_q = qmax_a = [a]elif q == max_q:max_a.append(a)return random.choice(max_a)def display_V(V):for i in range(len(V)):print(f'{V[i]:10.2f}', end=' ')if (i+1)%4 == 0:print('')# 从start开始搜索直到end,形成一个episode
def mc_search(Q,A,S,epsilon=1.0):episode = []is_done = Falses0 = S[0]while not is_done:a0 = epsilon_greedy_sa(Q, A, s0, epsilon=epsilon)s1,r1,is_done = step(s0,a0)trans = [s0, a0, r1]episode.append(trans)s0 = s1return episode 
2.3.1多次访问MC
# 从单条episode中学习
def mc_learning(Q,Ns,episode,gamma=0.99):for i in range(len(episode)):s,a,_ = episode[i]Ns[s][a] += 1retrn = 0.0power = 0for j in range(i,len(episode)):retrn += np.power(gamma,power)*episode[j][2]power += 1q = Q[s][a]Q[s][a] = q + (1/Ns[s][a])*(retrn-q)    return Q,Nsdef mc_learning_reversed(Q,Ns,episode,gamma=0.99):retrn = 0for i in reversed(range(0, len(episode))):s,a,_ = episode[i]Ns[s][a] += 1retrn = episode[i][2] + gamma*retrnq = Q[s][a]Q[s][a] = q + (1/Ns[s][a])*(retrn-q)return Q,Ns            # 搜索一条episode,训练一条episode
def mc_train(Q,Ns,S,A,epsilon=1.0,max_epsilon=1.0,epsilon_decay=1/100,min_epsilon=0.0001,gamma=0.99,iters=1):all_steps = []epsilons = []for i in range(iters):epsilons.append(epsilon)epsilon = max(min_epsilon, epsilon-(max_epsilon-min_epsilon)*epsilon_decay)episode = mc_search(Q,A,S,epsilon=epsilon)Q,Ns = mc_learning_reversed(Q,Ns,episode,gamma)all_steps.append(len(episode))return Q,Ns,all_steps,epsilons
Q = np.zeros((24,4))   
Ns = np.zeros((24,4))  Q,Ns,all_steps,epsilons = mc_train(Q,Ns,S,A,iters=1000)
V = Q.max(axis=1).tolist()
display_V(V)

输出:
-11.46 -15.30 -33.15 -29.29
-32.48 -7.24 -26.45 -24.35
-23.72 -5.60 -19.16 -16.52
-23.44 -4.27 -7.76 -7.34
-12.81 -3.28 -2.13 -1.00
-12.93 -9.86 -1.00 0.00

2.3.2 首次访问MC
# 从单条episode中学习
def mc_learning_first(Q,Ns,episode,gamma=0.95):sa_pairs = set([(x[0],x[1]) for x in episode])for s,a in sa_pairs:    Ns[s][a] += 1retrn = 0.0power = 0    first_idx = next(i for i,x in enumerate(episode) if x[0]==s and x[1]==a)for j in range(first_idx,len(episode)):retrn += np.power(gamma,power)*episode[j][2]power += 1q = Q[s][a]Q[s][a] = q + (1/Ns[s][a])*(retrn-q)    return Q,Ns
# mc_first训练函数
def mc_train_first(Q,Ns,S,A,epsilon=1.0,max_epsilon=1.0,epsilon_decay=1/100,min_epsilon=1e-5,gamma=1.0,iters=1):all_steps = []epsilons = []for i in range(iters):epsilon = max(min_epsilon, epsilon-(max_epsilon-min_epsilon)*epsilon_decay)epsilons.append(epsilon)episode = mc_search(Q,A,S,epsilon=epsilon)Q,Ns = mc_learning_first(Q,Ns,episode,gamma)all_steps.append(len(episode))return Q,Ns,all_steps,epsilons
Q = np.zeros((24,4))   # Q table ,可以初始化为服从0-1均匀分布;randn--标准正态分布;random--随机数
Ns = np.zeros((24,4))  # (s,a)搜索到的次数统计Q,Ns,all_steps,epsilons = mc_train_first(Q,Ns,S,A,iters=1000)
V = Q.min(axis=1).tolist()
display_V(V)

输出:
-40.67 -56.33 -51.25 -36.83
-44.45 -38.27 -39.86 -41.29
-31.93 -46.86 -41.71 -27.40
-26.26 -23.73 -27.22 -22.12
-25.57 -20.67 -26.29 -13.17
-19.50 -21.83 -16.50 0.00


本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!

相关文章

立即
投稿

微信公众账号

微信扫一扫加关注

返回
顶部