dqn在训练过程中loss越来越大_[动手学强化学习] 2.DQN解决CartPole-v0问题

2023-07-06 16:12:02

强化学习如何入门：

强化学习怎么入门好？www.zhihu.com

最近在整理之前写的强化学习代码，发现pytorch的代码还是老版本的。

而pytorch今年更新了一个大版本，更到0.4了，很多老代码都不兼容了，于是基于最新版重写了一下 CartPole-v0这个环境的DQN代码。

对代码进行了简化，网上其他很多代码不是太老就是太乱；
增加了一个动态绘图函数；
这次改动可以很快就达到200步，不过后期不稳定，还需要详细调整下探索-利用困境。

CartPole-v0环境：

Gym: A toolkit for developing and comparing reinforcement learning algorithmsgym.openai.com

DQN CartPole-v0源码，欢迎fork和star：

https://github.com/hangsz/reinforcement_learninggithub.com

需要安装gym库和pytorch

gym安装方式：pip install gym

pytorch(选择适合自己的版本): https://pytorch.org/get-started/locally/

动画：

https://www.zhihu.com/video/1193285883359604736

# coding: utf-8__author__ = 'zhenhang.sun@gmail.com'
__version__ = '1.0.0'import gym
import math
import randomimport torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optimclass Net(nn.Module):def __init__(self, input_size, hidden_size, output_size):super().__init__()self.linear1 = nn.Linear(input_size, hidden_size)self.linear2 = nn.Linear(hidden_size, output_size)def forward(self, x):x = F.relu(self.linear1(x))x = self.linear2(x)return xclass Agent(object):def __init__(self, **kwargs):for key, value in kwargs.items():setattr(self, key, value)self.eval_net = Net(self.state_space_dim, 256, self.action_space_dim)self.optimizer = optim.Adam(self.eval_net.parameters(), lr=self.lr)self.buffer = []self.steps = 0def act(self, s0):self.steps += 1epsi = self.epsi_low + (self.epsi_high-self.epsi_low) * (math.exp(-1.0 * self.steps/self.decay))if random.random() < epsi:a0 = random.randrange(self.action_space_dim)else:s0 =  torch.tensor(s0, dtype=torch.float).view(1,-1)a0 = torch.argmax(self.eval_net(s0)).item()return a0def put(self, *transition):if len( self.buffer)==self.capacity:self.buffer.pop(0)self.buffer.append(transition)def learn(self):if (len(self.buffer)) < self.batch_size:returnsamples = random.sample( self.buffer, self.batch_size)s0, a0, r1, s1 = zip(*samples)s0 = torch.tensor( s0, dtype=torch.float)a0 = torch.tensor( a0, dtype=torch.long).view(self.batch_size, -1)r1 = torch.tensor( r1, dtype=torch.float).view(self.batch_size, -1)s1 = torch.tensor( s1, dtype=torch.float)y_true = r1 + self.gamma * torch.max( self.eval_net(s1).detach(), dim=1)[0].view(self.batch_size, -1)y_pred = self.eval_net(s0).gather(1, a0)loss_fn = nn.MSELoss()loss = loss_fn(y_pred, y_true)self.optimizer.zero_grad()loss.backward()self.optimizer.step()

# coding: utf-8__author__ = 'zhenhang.sun@gmail.com'
__version__ = '1.0.0'import gym
from IPython import display
import matplotlib.pyplot as pltfrom dqn import Agentdef plot(score, mean):display.clear_output(wait=True)display.display(plt.gcf())plt.figure(figsize=(20,10))plt.clf()plt.title('Training...')plt.xlabel('Episode')plt.ylabel('Duration')plt.plot(score)plt.plot(mean)plt.text(len(score)-1, score[-1], str(score[-1]))plt.text(len(mean)-1, mean[-1], str(mean[-1]))if __name__ == '__main__':env = gym.make('CartPole-v0')params = {'gamma': 0.8,'epsi_high': 0.9,'epsi_low': 0.05,'decay': 200, 'lr': 0.001,'capacity': 10000,'batch_size': 64,'state_space_dim': env.observation_space.shape[0],'action_space_dim': env.action_space.n   }agent = Agent(**params)score = []mean = []for episode in range(1000):s0 = env.reset()total_reward = 1while True:env.render()a0 = agent.act(s0)s1, r1, done, _ = env.step(a0)if done:r1 = -1agent.put(s0, a0, r1, s1)if done:breaktotal_reward += r1s0 = s1agent.learn()score.append(total_reward)mean.append( sum(score[-100:])/100)plot(score, mean)

本文来自互联网用户投稿，文章观点仅代表作者本人，不代表本站立场，不承担相关法律责任。如若转载，请注明出处。 如若内容造成侵权/违法违规/事实不符，请点击【内容举报】进行投诉反馈！

标签：技术

上一篇 > 代码抠图_067，我学会了用代码来抠图
下一篇 > yolo 负样本_目标检测介绍之YOLO与SSD

Duilib中list控件支持ctrl和shif多行选中的实现

[ICML2015]Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shif

win10系统微软输入法于eclipse ctrl+shif+f冲突间接处理办法

Codeforces Round #259 (Div. 2) B. Little Pony and Sort by Shif

读LDD3，内存映射与DMA--PAGE_SHIF…

VMware虚拟机安装XP【要先分区，再设置BOOT 启动CD，shif+上移】

更换iBus五笔的左与右Shif

sublime ctrl+shif+f 没用解决办法

idea 对 ctrl + z 的撤销是 ctrl + shif + z

计算机最早的设计师应用于,计算机应用基础选择题doc.doc

win10自带截图神器：Win+Shift+S

Python基础之文件目录操作

python简述目录_Python基础之文件目录操作(示例代码)

tp5 如何做数据采集

任务2-7(服务器字体+阿里巴巴矢量库)

html标签（1)：h1~h6,p,br,pre,hr

TI 电量计介绍与芯片选型指南

几款TI电源芯片简介

TI DSP芯片C2000系列读取FLASH数据

德州仪器(Ti)平台嵌入式开发基础

TI三相电机智能栅极驱动芯片特点分类

省选模拟（12.08） T3 圈圈圈圈圈圈圈圈

Hadoop生态圈技术栈（上）

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之6.Impala交互式查询

小猿圈之Linux下Mysql 操作命令

大数据Hadoop生态圈常用面试题

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之4.Hive DDL、DQL和数据操作

备战Noip2018模拟赛11（B组）T3 Monogatari 物语

【智能优化算法-圆圈搜索算法】基于圆圈搜索算法Circle Search Algorithm求解单目标优化问题附matlab代码

NYOJ 78 圈水池

递归问题跑道汽车绕圈问题 Python实现

Hadoop生态圈（三）：MapReduce

dqn在训练过程中loss越来越大_[动手学强化学习] 2.DQN解决CartPole-v0问题

强化学习如何入门：

CartPole-v0环境：

DQN CartPole-v0源码，欢迎fork和star：

相关文章