强化学习算法 Sarsa 解迷宫游戏,代码逐条详解

本文内容源自百度强化学习 7 日入门课程学习整理
感谢百度 PARL 团队李科浇老师的课程讲解

强化学习算法 Sarsa 解迷宫游戏

文章目录

  • 一、安装依赖库
  • 二、导入依赖库
  • 三、智能体 Agent 的算法:Sarsa
  • 四、训练和测试语句
  • 五、创建环境,实例化Agent,启动训练和测试
  • 五、结果分析

一、安装依赖库

安装强化学习算法中环境库 Gym

pip install gym

二、导入依赖库

import gym
import numpy as np
import time # 用于延时程序,方便渲染画面

三、智能体 Agent 的算法:Sarsa

  • 智能体 Agent 是和环境 environment 交互的主体
    • 包含了观察当前状态
    • 根据当前状态作出动作选择
    • 根据选择后的结果更新 Q 值表
  • predict() 方法:输入观察值 observation(或者说状态state),输出 “预测” 动作 action (最优动作)
    • 观察当前状态下,所有可以采用的 action 对应的 Q 值
    • 在其中选取最大的,组成一个列表
    • 该列表对应可能选取的最优动作列表
    • 在最优动作列表中随机选取一个动作
  • sample() 方法:在 predict() 方法基础上使用 ε-greedy 增加探索,输出 “实际” 动作 action
    • 采用 epsilon greedy 算法
    • 90% 概率选择最优动作
    • 10% 概率选择随机动作
  • learn() 方法:输入训练数据,完成一轮Q表格的更新
    • 更新的是之前状态 obs 下采取动作 action 后的 Q 值
    • 如果游戏结束,则 reward 为新的 Q 值
    • 如果游戏没有结束,则 reward 和下一步的 Q 值结合产生新的 Q 值
    • 同时用学习速率 lr 做更新约束
class SarsaAgent(object):def __init__(self, obs_n, act_n, learning_rate=0.01, gamma=0.9, e_greed=0.1):self.act_n = act_n      # 动作维度,有几个动作可选self.lr = learning_rate # 学习率self.gamma = gamma      # 后面的 Q 值对前面的影响self.epsilon = e_greed  # 按一定概率随机选动作self.Q = np.zeros((obs_n, act_n))# 根据输入观察值,采样输出的动作值(带 10% 的探索)def sample(self, obs):if (np.random.uniform(0, 1) < 1 - self.epsilon): # 这里是 90% 可能性action = self.predict(obs) # 执行最优动作else: # 10% 的概率action = np.random.choice(self.act_n) # 执行随机动作return action# 根据输入观察值,预测输出的动作值def predict(self, obs):Q_list = self.Q[obs, :] # 获取当前状态下,作出所有动作,对应的 Q 值列表maxQ = np.max(Q_list) # 求列表中的最大值action_list = np.where(Q_list == maxQ)[0] # 最大 Q 值对应的动作即最优动作action = np.random.choice(action_list) # 随机选择一个最优动作return action# 学习方法,也就是更新Q-table的方法def learn(self, obs, action, reward, next_obs, next_action, done):""" on-policyobs: 交互前的obs, s_taction: 本次交互选择的action, a_treward: 本次动作获得的奖励rnext_obs: 本次交互后的obs, s_t+1next_action: 根据当前Q表格, 针对next_obs会选择的动作, a_t+1done: episode是否结束"""predict_Q = self.Q[obs,action] # 交互前的状态下,选择的动作所对应 Q 值if (done): # 游戏结束target_Q = reward # 新的 Q 值为 rewardelse: # 游戏没有结束target_Q = reward + self.gamma * self.Q[next_obs, next_action]# 用 reward 和 交互后状态下,选择的下一个动作对应的 Q 值,综合得到新的 Q 值self.Q[obs,action] += self.lr * (target_Q - predict_Q) # 使用 lr 做修正更新的幅度# 保存Q表格数据到文件def save(self):npy_file = './q_table.npy'np.save(npy_file, self.Q)print(npy_file + ' saved.')# 从文件中读取数据到Q表格中def restore(self, npy_file='./q_table.npy'):self.Q = np.load(npy_file)print(npy_file + ' loaded.')

四、训练和测试语句

每一局游戏,记录下步数 total_steps 和总奖励 total_reward

每一步都更新 Q 值表

def run_episode(env, agent, render=False):total_steps = 0 # 记录每个episode走了多少steptotal_reward = 0 # 记录每一局游戏的总奖励obs = env.reset() # 重置环境, 重新开一局(即开始新的一个episode)action = agent.sample(obs) # 根据算法选择一个动作while True:next_obs, reward, done, _ = env.step(action) # 与环境进行一个交互,执行动作next_action = agent.sample(next_obs) # 根据算法选择下一个动作# 训练 Sarsa 算法agent.learn(obs, action, reward, next_obs, next_action, done)# obs 执行动作前的状态,action 执行的动作,得到预测的 Q0# reward 执行动作后的奖励,next_obs 执行动作后的状态,next_action 选择的下一个动作,得到更新的 Q0# done 判断游戏是否结束action = next_action # 迭代新的动作obs = next_obs  # 存储上一个观察值,迭代新的状态total_reward += reward # 累计奖励total_steps += 1 # 计算step数if render: # 判断是否需要渲染图形显示env.render() #渲染新的一帧图形if done: # 游戏结束break # 跳出循环,即结束本局游戏return total_reward, total_steps # 返回总的奖励和总的步数def test_episode(env, agent):total_reward = 0 # 记录总的奖励obs = env.reset() # 重置环境,obs 初始观察值,即初始状态while True:action = agent.predict(obs) # greedy,每次选择最优动作next_obs, reward, done, _ = env.step(action) # 交互后,获取新的状态,奖励,游戏是否结束total_reward += reward # 累计奖励obs = next_obs # 迭代更新状态time.sleep(0.5) # 休眠,以便于我们观察渲染的图形env.render() # 渲染图形显示if done: # 游戏结束break # 跳出循环return total_reward # 返回最终累计奖励

五、创建环境,实例化Agent,启动训练和测试

使用 Gym 库创建我们需要的环境

实例化 SarsaAgent 类,创建一个 Agent 对象,同时设定超参数

训练 500 局游戏,查看每一局游戏的结果

训练结束后进行测试

# 使用gym创建迷宫环境,设置is_slippery为False降低环境难度
env = gym.make("FrozenLake-v0", is_slippery=False)  # 0 left, 1 down, 2 right, 3 up
# 使用 make 方法创建需要的环境# 创建一个agent实例,输入超参数
agent = SarsaAgent(obs_n=env.observation_space.n, # 16 个状态代表这个环境中 4*4 一共 16 个格子act_n=env.action_space.n, # 4 种动作选择:0 left, 1 down, 2 right, 3 uplearning_rate=0.1, # 学习速率gamma=0.9, # 下一步的影响率e_greed=0.1) # 随机选择概率# 训练500个episode,打印每个episode的分数
for episode in range(500):ep_reward, ep_steps = run_episode(env, agent, False)print('Episode %s: steps = %s , reward = %.1f' % (episode, ep_steps, ep_reward))# 全部训练结束,查看算法效果
test_reward = test_episode(env, agent)
print('test reward = %.1f' % (test_reward))

运行结果:

Episode 0: steps = 6 , reward = 0.0
Episode 1: steps = 17 , reward = 0.0
Episode 2: steps = 9 , reward = 0.0
Episode 3: steps = 2 , reward = 0.0
Episode 4: steps = 8 , reward = 0.0
Episode 5: steps = 8 , reward = 0.0
Episode 6: steps = 14 , reward = 0.0
Episode 7: steps = 7 , reward = 0.0
Episode 8: steps = 7 , reward = 0.0
Episode 9: steps = 2 , reward = 0.0
Episode 10: steps = 3 , reward = 0.0
Episode 11: steps = 8 , reward = 0.0
Episode 12: steps = 3 , reward = 0.0
Episode 13: steps = 8 , reward = 0.0
Episode 14: steps = 6 , reward = 0.0
Episode 15: steps = 5 , reward = 0.0
Episode 16: steps = 5 , reward = 0.0
Episode 17: steps = 7 , reward = 0.0
Episode 18: steps = 2 , reward = 0.0
Episode 19: steps = 7 , reward = 0.0
Episode 20: steps = 2 , reward = 0.0
Episode 21: steps = 7 , reward = 0.0
Episode 22: steps = 6 , reward = 0.0
Episode 23: steps = 3 , reward = 0.0
Episode 24: steps = 4 , reward = 0.0
Episode 25: steps = 4 , reward = 0.0
Episode 26: steps = 17 , reward = 0.0
Episode 27: steps = 11 , reward = 0.0
Episode 28: steps = 4 , reward = 0.0
Episode 29: steps = 9 , reward = 0.0
Episode 30: steps = 3 , reward = 0.0
Episode 31: steps = 11 , reward = 0.0
Episode 32: steps = 7 , reward = 0.0
Episode 33: steps = 3 , reward = 0.0
Episode 34: steps = 16 , reward = 0.0
Episode 35: steps = 10 , reward = 0.0
Episode 36: steps = 2 , reward = 0.0
Episode 37: steps = 9 , reward = 0.0
Episode 38: steps = 9 , reward = 0.0
Episode 39: steps = 19 , reward = 1.0
Episode 40: steps = 6 , reward = 0.0
Episode 41: steps = 6 , reward = 0.0
Episode 42: steps = 7 , reward = 0.0
Episode 43: steps = 4 , reward = 0.0
Episode 44: steps = 4 , reward = 0.0
Episode 45: steps = 5 , reward = 0.0
Episode 46: steps = 4 , reward = 0.0
Episode 47: steps = 22 , reward = 1.0
Episode 48: steps = 2 , reward = 0.0
Episode 49: steps = 2 , reward = 0.0
Episode 50: steps = 2 , reward = 0.0
Episode 51: steps = 17 , reward = 0.0
Episode 52: steps = 14 , reward = 0.0
Episode 53: steps = 6 , reward = 0.0
Episode 54: steps = 8 , reward = 0.0
Episode 55: steps = 18 , reward = 0.0
Episode 56: steps = 5 , reward = 0.0
Episode 57: steps = 2 , reward = 0.0
Episode 58: steps = 8 , reward = 0.0
Episode 59: steps = 4 , reward = 0.0
Episode 60: steps = 10 , reward = 0.0
Episode 61: steps = 2 , reward = 0.0
Episode 62: steps = 11 , reward = 0.0
Episode 63: steps = 21 , reward = 0.0
Episode 64: steps = 4 , reward = 0.0
Episode 65: steps = 2 , reward = 0.0
Episode 66: steps = 3 , reward = 0.0
Episode 67: steps = 3 , reward = 0.0
Episode 68: steps = 18 , reward = 1.0
Episode 69: steps = 6 , reward = 0.0
Episode 70: steps = 8 , reward = 0.0
Episode 71: steps = 8 , reward = 0.0
Episode 72: steps = 4 , reward = 0.0
Episode 73: steps = 13 , reward = 0.0
Episode 74: steps = 3 , reward = 0.0
Episode 75: steps = 7 , reward = 0.0
Episode 76: steps = 8 , reward = 0.0
Episode 77: steps = 3 , reward = 0.0
Episode 78: steps = 7 , reward = 0.0
Episode 79: steps = 8 , reward = 0.0
Episode 80: steps = 7 , reward = 0.0
Episode 81: steps = 10 , reward = 1.0
Episode 82: steps = 6 , reward = 1.0
Episode 83: steps = 9 , reward = 1.0
Episode 84: steps = 6 , reward = 0.0
Episode 85: steps = 6 , reward = 1.0
Episode 86: steps = 3 , reward = 0.0
Episode 87: steps = 7 , reward = 1.0
Episode 88: steps = 6 , reward = 1.0
Episode 89: steps = 7 , reward = 1.0
Episode 90: steps = 6 , reward = 1.0
Episode 91: steps = 6 , reward = 1.0
Episode 92: steps = 10 , reward = 1.0
Episode 93: steps = 6 , reward = 1.0
Episode 94: steps = 8 , reward = 1.0
Episode 95: steps = 6 , reward = 1.0
Episode 96: steps = 7 , reward = 1.0
Episode 97: steps = 6 , reward = 1.0
Episode 98: steps = 6 , reward = 1.0
Episode 99: steps = 8 , reward = 1.0
Episode 100: steps = 6 , reward = 1.0
Episode 101: steps = 8 , reward = 1.0
Episode 102: steps = 6 , reward = 1.0
Episode 103: steps = 6 , reward = 1.0
Episode 104: steps = 6 , reward = 1.0
Episode 105: steps = 8 , reward = 1.0
Episode 106: steps = 6 , reward = 1.0
Episode 107: steps = 6 , reward = 1.0
Episode 108: steps = 6 , reward = 1.0
Episode 109: steps = 6 , reward = 1.0
Episode 110: steps = 4 , reward = 0.0
Episode 111: steps = 6 , reward = 1.0
Episode 112: steps = 6 , reward = 1.0
Episode 113: steps = 6 , reward = 1.0
Episode 114: steps = 6 , reward = 1.0
Episode 115: steps = 7 , reward = 1.0
Episode 116: steps = 7 , reward = 1.0
Episode 117: steps = 10 , reward = 1.0
Episode 118: steps = 5 , reward = 0.0
Episode 119: steps = 6 , reward = 1.0
Episode 120: steps = 3 , reward = 0.0
Episode 121: steps = 6 , reward = 1.0
Episode 122: steps = 6 , reward = 1.0
Episode 123: steps = 9 , reward = 1.0
Episode 124: steps = 6 , reward = 1.0
Episode 125: steps = 5 , reward = 0.0
Episode 126: steps = 6 , reward = 1.0
Episode 127: steps = 6 , reward = 1.0
Episode 128: steps = 8 , reward = 1.0
Episode 129: steps = 6 , reward = 1.0
Episode 130: steps = 6 , reward = 1.0
Episode 131: steps = 8 , reward = 1.0
Episode 132: steps = 8 , reward = 1.0
Episode 133: steps = 6 , reward = 1.0
Episode 134: steps = 6 , reward = 1.0
Episode 135: steps = 6 , reward = 1.0
Episode 136: steps = 6 , reward = 1.0
Episode 137: steps = 6 , reward = 1.0
Episode 138: steps = 6 , reward = 1.0
Episode 139: steps = 4 , reward = 0.0
Episode 140: steps = 6 , reward = 1.0
Episode 141: steps = 6 , reward = 1.0
Episode 142: steps = 6 , reward = 1.0
Episode 143: steps = 9 , reward = 1.0
Episode 144: steps = 6 , reward = 1.0
Episode 145: steps = 6 , reward = 1.0
Episode 146: steps = 6 , reward = 1.0
Episode 147: steps = 7 , reward = 1.0
Episode 148: steps = 7 , reward = 1.0
Episode 149: steps = 6 , reward = 1.0
Episode 150: steps = 6 , reward = 1.0
Episode 151: steps = 6 , reward = 1.0
Episode 152: steps = 7 , reward = 1.0
Episode 153: steps = 6 , reward = 1.0
Episode 154: steps = 6 , reward = 1.0
Episode 155: steps = 7 , reward = 1.0
Episode 156: steps = 7 , reward = 1.0
Episode 157: steps = 7 , reward = 1.0
Episode 158: steps = 6 , reward = 1.0
Episode 159: steps = 6 , reward = 1.0
Episode 160: steps = 6 , reward = 1.0
Episode 161: steps = 4 , reward = 0.0
Episode 162: steps = 6 , reward = 1.0
Episode 163: steps = 5 , reward = 0.0
Episode 164: steps = 6 , reward = 1.0
Episode 165: steps = 6 , reward = 1.0
Episode 166: steps = 6 , reward = 1.0
Episode 167: steps = 6 , reward = 1.0
Episode 168: steps = 9 , reward = 1.0
Episode 169: steps = 6 , reward = 1.0
Episode 170: steps = 8 , reward = 1.0
Episode 171: steps = 6 , reward = 1.0
Episode 172: steps = 6 , reward = 1.0
Episode 173: steps = 6 , reward = 1.0
Episode 174: steps = 6 , reward = 1.0
Episode 175: steps = 6 , reward = 1.0
Episode 176: steps = 6 , reward = 1.0
Episode 177: steps = 6 , reward = 1.0
Episode 178: steps = 8 , reward = 1.0
Episode 179: steps = 6 , reward = 1.0
Episode 180: steps = 6 , reward = 1.0
Episode 181: steps = 3 , reward = 0.0
Episode 182: steps = 6 , reward = 1.0
Episode 183: steps = 6 , reward = 1.0
Episode 184: steps = 6 , reward = 1.0
Episode 185: steps = 8 , reward = 1.0
Episode 186: steps = 10 , reward = 1.0
Episode 187: steps = 8 , reward = 1.0
Episode 188: steps = 6 , reward = 1.0
Episode 189: steps = 6 , reward = 1.0
Episode 190: steps = 6 , reward = 1.0
Episode 191: steps = 6 , reward = 1.0
Episode 192: steps = 7 , reward = 1.0
Episode 193: steps = 6 , reward = 1.0
Episode 194: steps = 6 , reward = 1.0
Episode 195: steps = 8 , reward = 1.0
Episode 196: steps = 6 , reward = 1.0
Episode 197: steps = 4 , reward = 0.0
Episode 198: steps = 5 , reward = 0.0
Episode 199: steps = 6 , reward = 1.0
Episode 200: steps = 6 , reward = 1.0
Episode 201: steps = 6 , reward = 1.0
Episode 202: steps = 4 , reward = 0.0
Episode 203: steps = 8 , reward = 1.0
Episode 204: steps = 8 , reward = 1.0
Episode 205: steps = 7 , reward = 1.0
Episode 206: steps = 6 , reward = 1.0
Episode 207: steps = 6 , reward = 1.0
Episode 208: steps = 6 , reward = 1.0
Episode 209: steps = 8 , reward = 1.0
Episode 210: steps = 7 , reward = 1.0
Episode 211: steps = 6 , reward = 1.0
Episode 212: steps = 6 , reward = 1.0
Episode 213: steps = 10 , reward = 1.0
Episode 214: steps = 6 , reward = 1.0
Episode 215: steps = 6 , reward = 1.0
Episode 216: steps = 6 , reward = 1.0
Episode 217: steps = 6 , reward = 1.0
Episode 218: steps = 6 , reward = 1.0
Episode 219: steps = 6 , reward = 1.0
Episode 220: steps = 6 , reward = 1.0
Episode 221: steps = 7 , reward = 1.0
Episode 222: steps = 6 , reward = 1.0
Episode 223: steps = 6 , reward = 1.0
Episode 224: steps = 6 , reward = 1.0
Episode 225: steps = 6 , reward = 1.0
Episode 226: steps = 6 , reward = 1.0
Episode 227: steps = 6 , reward = 1.0
Episode 228: steps = 7 , reward = 1.0
Episode 229: steps = 6 , reward = 1.0
Episode 230: steps = 6 , reward = 1.0
Episode 231: steps = 10 , reward = 1.0
Episode 232: steps = 6 , reward = 1.0
Episode 233: steps = 6 , reward = 1.0
Episode 234: steps = 6 , reward = 1.0
Episode 235: steps = 8 , reward = 1.0
Episode 236: steps = 6 , reward = 1.0
Episode 237: steps = 6 , reward = 1.0
Episode 238: steps = 6 , reward = 1.0
Episode 239: steps = 8 , reward = 1.0
Episode 240: steps = 6 , reward = 1.0
Episode 241: steps = 6 , reward = 1.0
Episode 242: steps = 8 , reward = 1.0
Episode 243: steps = 2 , reward = 0.0
Episode 244: steps = 6 , reward = 1.0
Episode 245: steps = 6 , reward = 1.0
Episode 246: steps = 6 , reward = 1.0
Episode 247: steps = 6 , reward = 1.0
Episode 248: steps = 6 , reward = 1.0
Episode 249: steps = 6 , reward = 1.0
Episode 250: steps = 7 , reward = 1.0
Episode 251: steps = 6 , reward = 1.0
Episode 252: steps = 2 , reward = 0.0
Episode 253: steps = 6 , reward = 1.0
Episode 254: steps = 6 , reward = 1.0
Episode 255: steps = 6 , reward = 1.0
Episode 256: steps = 8 , reward = 1.0
Episode 257: steps = 6 , reward = 1.0
Episode 258: steps = 6 , reward = 1.0
Episode 259: steps = 7 , reward = 1.0
Episode 260: steps = 6 , reward = 1.0
Episode 261: steps = 6 , reward = 1.0
Episode 262: steps = 7 , reward = 1.0
Episode 263: steps = 6 , reward = 1.0
Episode 264: steps = 6 , reward = 1.0
Episode 265: steps = 6 , reward = 1.0
Episode 266: steps = 6 , reward = 1.0
Episode 267: steps = 7 , reward = 1.0
Episode 268: steps = 6 , reward = 1.0
Episode 269: steps = 6 , reward = 1.0
Episode 270: steps = 6 , reward = 1.0
Episode 271: steps = 6 , reward = 1.0
Episode 272: steps = 6 , reward = 1.0
Episode 273: steps = 7 , reward = 1.0
Episode 274: steps = 3 , reward = 0.0
Episode 275: steps = 8 , reward = 1.0
Episode 276: steps = 7 , reward = 1.0
Episode 277: steps = 4 , reward = 0.0
Episode 278: steps = 6 , reward = 1.0
Episode 279: steps = 4 , reward = 0.0
Episode 280: steps = 7 , reward = 1.0
Episode 281: steps = 6 , reward = 1.0
Episode 282: steps = 6 , reward = 1.0
Episode 283: steps = 6 , reward = 1.0
Episode 284: steps = 6 , reward = 1.0
Episode 285: steps = 7 , reward = 1.0
Episode 286: steps = 8 , reward = 1.0
Episode 287: steps = 6 , reward = 1.0
Episode 288: steps = 5 , reward = 0.0
Episode 289: steps = 8 , reward = 1.0
Episode 290: steps = 7 , reward = 1.0
Episode 291: steps = 8 , reward = 1.0
Episode 292: steps = 4 , reward = 0.0
Episode 293: steps = 6 , reward = 1.0
Episode 294: steps = 9 , reward = 1.0
Episode 295: steps = 6 , reward = 1.0
Episode 296: steps = 6 , reward = 1.0
Episode 297: steps = 6 , reward = 0.0
Episode 298: steps = 6 , reward = 1.0
Episode 299: steps = 6 , reward = 1.0
Episode 300: steps = 6 , reward = 1.0
Episode 301: steps = 5 , reward = 0.0
Episode 302: steps = 6 , reward = 1.0
Episode 303: steps = 7 , reward = 1.0
Episode 304: steps = 6 , reward = 1.0
Episode 305: steps = 8 , reward = 1.0
Episode 306: steps = 6 , reward = 1.0
Episode 307: steps = 6 , reward = 1.0
Episode 308: steps = 6 , reward = 1.0
Episode 309: steps = 6 , reward = 1.0
Episode 310: steps = 4 , reward = 0.0
Episode 311: steps = 7 , reward = 1.0
Episode 312: steps = 8 , reward = 1.0
Episode 313: steps = 7 , reward = 1.0
Episode 314: steps = 6 , reward = 1.0
Episode 315: steps = 6 , reward = 1.0
Episode 316: steps = 7 , reward = 1.0
Episode 317: steps = 6 , reward = 1.0
Episode 318: steps = 6 , reward = 1.0
Episode 319: steps = 6 , reward = 1.0
Episode 320: steps = 6 , reward = 1.0
Episode 321: steps = 6 , reward = 1.0
Episode 322: steps = 7 , reward = 1.0
Episode 323: steps = 6 , reward = 1.0
Episode 324: steps = 6 , reward = 1.0
Episode 325: steps = 6 , reward = 1.0
Episode 326: steps = 6 , reward = 1.0
Episode 327: steps = 6 , reward = 1.0
Episode 328: steps = 6 , reward = 1.0
Episode 329: steps = 6 , reward = 1.0
Episode 330: steps = 6 , reward = 1.0
Episode 331: steps = 6 , reward = 1.0
Episode 332: steps = 6 , reward = 1.0
Episode 333: steps = 6 , reward = 1.0
Episode 334: steps = 3 , reward = 0.0
Episode 335: steps = 6 , reward = 1.0
Episode 336: steps = 6 , reward = 1.0
Episode 337: steps = 4 , reward = 0.0
Episode 338: steps = 6 , reward = 1.0
Episode 339: steps = 8 , reward = 1.0
Episode 340: steps = 6 , reward = 1.0
Episode 341: steps = 6 , reward = 1.0
Episode 342: steps = 6 , reward = 1.0
Episode 343: steps = 6 , reward = 1.0
Episode 344: steps = 6 , reward = 1.0
Episode 345: steps = 6 , reward = 1.0
Episode 346: steps = 6 , reward = 1.0
Episode 347: steps = 6 , reward = 1.0
Episode 348: steps = 6 , reward = 1.0
Episode 349: steps = 6 , reward = 1.0
Episode 350: steps = 6 , reward = 1.0
Episode 351: steps = 7 , reward = 1.0
Episode 352: steps = 6 , reward = 1.0
Episode 353: steps = 10 , reward = 1.0
Episode 354: steps = 3 , reward = 0.0
Episode 355: steps = 7 , reward = 1.0
Episode 356: steps = 7 , reward = 1.0
Episode 357: steps = 6 , reward = 1.0
Episode 358: steps = 2 , reward = 0.0
Episode 359: steps = 6 , reward = 1.0
Episode 360: steps = 6 , reward = 1.0
Episode 361: steps = 6 , reward = 1.0
Episode 362: steps = 7 , reward = 1.0
Episode 363: steps = 8 , reward = 1.0
Episode 364: steps = 6 , reward = 1.0
Episode 365: steps = 2 , reward = 0.0
Episode 366: steps = 6 , reward = 1.0
Episode 367: steps = 5 , reward = 0.0
Episode 368: steps = 6 , reward = 1.0
Episode 369: steps = 6 , reward = 1.0
Episode 370: steps = 6 , reward = 1.0
Episode 371: steps = 6 , reward = 1.0
Episode 372: steps = 6 , reward = 1.0
Episode 373: steps = 6 , reward = 1.0
Episode 374: steps = 8 , reward = 1.0
Episode 375: steps = 9 , reward = 1.0
Episode 376: steps = 6 , reward = 0.0
Episode 377: steps = 6 , reward = 1.0
Episode 378: steps = 6 , reward = 1.0
Episode 379: steps = 8 , reward = 1.0
Episode 380: steps = 6 , reward = 1.0
Episode 381: steps = 6 , reward = 1.0
Episode 382: steps = 6 , reward = 1.0
Episode 383: steps = 6 , reward = 1.0
Episode 384: steps = 6 , reward = 1.0
Episode 385: steps = 6 , reward = 1.0
Episode 386: steps = 8 , reward = 1.0
Episode 387: steps = 6 , reward = 1.0
Episode 388: steps = 6 , reward = 1.0
Episode 389: steps = 2 , reward = 0.0
Episode 390: steps = 6 , reward = 1.0
Episode 391: steps = 6 , reward = 1.0
Episode 392: steps = 6 , reward = 1.0
Episode 393: steps = 6 , reward = 1.0
Episode 394: steps = 7 , reward = 1.0
Episode 395: steps = 6 , reward = 1.0
Episode 396: steps = 6 , reward = 1.0
Episode 397: steps = 6 , reward = 1.0
Episode 398: steps = 6 , reward = 1.0
Episode 399: steps = 7 , reward = 1.0
Episode 400: steps = 6 , reward = 1.0
Episode 401: steps = 6 , reward = 1.0
Episode 402: steps = 6 , reward = 1.0
Episode 403: steps = 6 , reward = 1.0
Episode 404: steps = 8 , reward = 1.0
Episode 405: steps = 6 , reward = 1.0
Episode 406: steps = 6 , reward = 1.0
Episode 407: steps = 6 , reward = 1.0
Episode 408: steps = 6 , reward = 1.0
Episode 409: steps = 6 , reward = 1.0
Episode 410: steps = 6 , reward = 1.0
Episode 411: steps = 6 , reward = 1.0
Episode 412: steps = 6 , reward = 1.0
Episode 413: steps = 6 , reward = 1.0
Episode 414: steps = 6 , reward = 1.0
Episode 415: steps = 9 , reward = 1.0
Episode 416: steps = 6 , reward = 1.0
Episode 417: steps = 4 , reward = 0.0
Episode 418: steps = 6 , reward = 1.0
Episode 419: steps = 6 , reward = 1.0
Episode 420: steps = 7 , reward = 1.0
Episode 421: steps = 6 , reward = 1.0
Episode 422: steps = 6 , reward = 1.0
Episode 423: steps = 10 , reward = 1.0
Episode 424: steps = 6 , reward = 1.0
Episode 425: steps = 6 , reward = 1.0
Episode 426: steps = 8 , reward = 1.0
Episode 427: steps = 6 , reward = 1.0
Episode 428: steps = 9 , reward = 1.0
Episode 429: steps = 6 , reward = 1.0
Episode 430: steps = 4 , reward = 0.0
Episode 431: steps = 6 , reward = 1.0
Episode 432: steps = 6 , reward = 1.0
Episode 433: steps = 6 , reward = 1.0
Episode 434: steps = 6 , reward = 1.0
Episode 435: steps = 8 , reward = 1.0
Episode 436: steps = 6 , reward = 1.0
Episode 437: steps = 6 , reward = 1.0
Episode 438: steps = 6 , reward = 1.0
Episode 439: steps = 8 , reward = 1.0
Episode 440: steps = 2 , reward = 0.0
Episode 441: steps = 6 , reward = 1.0
Episode 442: steps = 10 , reward = 1.0
Episode 443: steps = 6 , reward = 1.0
Episode 444: steps = 6 , reward = 1.0
Episode 445: steps = 8 , reward = 1.0
Episode 446: steps = 6 , reward = 1.0
Episode 447: steps = 6 , reward = 1.0
Episode 448: steps = 5 , reward = 0.0
Episode 449: steps = 6 , reward = 1.0
Episode 450: steps = 8 , reward = 1.0
Episode 451: steps = 6 , reward = 1.0
Episode 452: steps = 8 , reward = 1.0
Episode 453: steps = 8 , reward = 1.0
Episode 454: steps = 7 , reward = 1.0
Episode 455: steps = 5 , reward = 0.0
Episode 456: steps = 6 , reward = 1.0
Episode 457: steps = 6 , reward = 1.0
Episode 458: steps = 8 , reward = 1.0
Episode 459: steps = 8 , reward = 1.0
Episode 460: steps = 10 , reward = 1.0
Episode 461: steps = 8 , reward = 1.0
Episode 462: steps = 7 , reward = 1.0
Episode 463: steps = 7 , reward = 1.0
Episode 464: steps = 6 , reward = 1.0
Episode 465: steps = 6 , reward = 1.0
Episode 466: steps = 6 , reward = 1.0
Episode 467: steps = 6 , reward = 1.0
Episode 468: steps = 6 , reward = 1.0
Episode 469: steps = 6 , reward = 1.0
Episode 470: steps = 3 , reward = 0.0
Episode 471: steps = 7 , reward = 1.0
Episode 472: steps = 6 , reward = 1.0
Episode 473: steps = 6 , reward = 1.0
Episode 474: steps = 7 , reward = 1.0
Episode 475: steps = 6 , reward = 1.0
Episode 476: steps = 8 , reward = 1.0
Episode 477: steps = 6 , reward = 1.0
Episode 478: steps = 6 , reward = 1.0
Episode 479: steps = 6 , reward = 1.0
Episode 480: steps = 6 , reward = 1.0
Episode 481: steps = 6 , reward = 1.0
Episode 482: steps = 6 , reward = 1.0
Episode 483: steps = 6 , reward = 1.0
Episode 484: steps = 5 , reward = 0.0
Episode 485: steps = 6 , reward = 1.0
Episode 486: steps = 9 , reward = 1.0
Episode 487: steps = 7 , reward = 1.0
Episode 488: steps = 6 , reward = 1.0
Episode 489: steps = 6 , reward = 1.0
Episode 490: steps = 6 , reward = 1.0
Episode 491: steps = 6 , reward = 1.0
Episode 492: steps = 9 , reward = 1.0
Episode 493: steps = 6 , reward = 1.0
Episode 494: steps = 6 , reward = 1.0
Episode 495: steps = 9 , reward = 1.0
Episode 496: steps = 6 , reward = 1.0
Episode 497: steps = 6 , reward = 1.0
Episode 498: steps = 6 , reward = 1.0
Episode 499: steps = 7 , reward = 1.0(Down)
SFFF
FHFH
FFFH
HFFG(Down)
SFFF
FHFH
FFFH
HFFG(Right)
SFFF
FHFH
FFFH
HFFG(Down)
SFFF
FHFH
FFFH
HFFG(Right)
SFFF
FHFH
FFFH
HFFG(Right)
SFFF
FHFH
FFFH
HFFG
test reward = 1.0

五、结果分析

我们可以查看下最终训练完成的 Q 表:

print(agent.Q)

运行结果:

[[0.27140285 0.4364344  0.09145568 0.15201279][0.26813138 0.         0.         0.00945424][0.         0.         0.         0.        ][0.         0.         0.         0.        ][0.26636559 0.51632351 0.         0.13684245][0.         0.         0.         0.        ][0.         0.         0.         0.        ][0.         0.         0.         0.        ][0.33346755 0.         0.68004322 0.31572772][0.26970648 0.77477987 0.35436455 0.        ][0.04662094 0.73217092 0.         0.        ][0.         0.         0.         0.        ][0.         0.         0.         0.        ][0.         0.39939922 0.88159607 0.11581402][0.4472322  0.72976712 1.         0.40947544][0.         0.         0.         0.        ]]

16 个格子对应的情况:

SFFF
FHFH
FFFH
HFFG

其中 S 代表起点,F 代表平地,H 代表陷阱(掉进去游戏结束),G 代表终点(到达则获胜)

每个格子的排序序号:

0  1  2  3
4  5  6  7
8  9  10 11
12 13 14 15

所以测试开始后,首先在第 0 格,这个时候的 4 个动作对应的 Q 值是:

[0.27140285 0.4364344  0.09145568 0.15201279]

这 4 个 Q 值对应:0 left,1 down,2 right,3 up

所以最大值 0.4364344 对应的是 1,即动作为往下走一格

这个时候到达了第 4 个格子:

[0.26636559 0.51632351 0.         0.13684245]

选择 1,动作:down,到达第 8 个格子:

[0.33346755 0.         0.68004322 0.31572772]

选择 2,动作:right,到达第 9 个格子:

[0.26970648 0.77477987 0.35436455 0.        ]

选择 1,动作:down,到达第 13 个格子:

[0.         0.39939922 0.88159607 0.11581402]

选择 2,动作 right,到达第 14 个格子:

[0.4472322  0.72976712 1.         0.40947544]

选择 2,动作 right,到达第 15 个格子:终点!


本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!

相关文章

立即
投稿

微信公众账号

微信扫一扫加关注

返回
顶部