Rllib学习[1] --rllib基本指令
文章目录
- Ray 介绍
- Rllib的安装
- RLlib介绍
- RLlib框架
- 使用trainer进行训练
- trainer参数设置
- 获取训练好的模型/policy
- policy 模型参数设置
- 直接使用tune进行强化学习
- 引用
Ray 介绍
Ray是一个用于构建和运行分布式应用程序的快速而简单的框架。
Ray通过以下方式完成这一任务:
- 为构建和运行分布式应用程序提供简单的单元。
- 允许终端用户并行化单个机器代码,几乎不需要更改代码。
- 在Ray Core之上包含一个大型的应用程序、库和工具生态系统,以支持复杂的应用程序。
Ray core中主要有四个库:
Tune: 用于超参数调参
Rllib:强化学习
RaySGD: 分布式训练包装
Ray Serve: 用于程序服务
Rllib的安装
如果需要使用atari,pytorch,tensorflow等,都需要自己下载。 如果使用GPU,请提前安装GPU对应的pytorch/tensorflow,避免ray安装过程中自动安装不合适的版本。
pip install -U ray
pip install -U ray[tune]
pip install ray[default]
pip install -U "ray[rllib]"
RLlib介绍

从上图可以看出,最底层的分布式计算任务是由Ray引擎支撑的。倒数第二层表明RLlib是对特定的强化学习任务进行的抽象。第二层表示面向开发者,我们可以自定义算法。最顶层是RLlib对一些应用的支持,比如:可以让智能体在离线的数据、Gym或者Unit3d的环境中进行交互等等。
对于纯强化学习算法,我们实际上可以直接调用Rllib中已经写好的函数/类来实现。但是如果需要修改policy/value function model / your own experience replay / add imitation learning / add encironment dynamics等等,就需要在原来的Rllib模块中修改。**而修改也只需要在对应模块修改,其他的模块都无需变动。**因此,了解RLlib中每一个模块,是非常重要的。
RLlib框架

对于单智能体-单环境来说, 我们在trainer中创建Policy,我们根据policy就可以得到我们需要的价值函数/策略函数。而更新所需要的sample 则在worker中创建。我们可以创建一个worker,或者同时创建多个worker,然后多个worker每次从仅有的这一个trainer中得到对应的动作指令,生成sample。将不同worker生成的sample合在一起,传递回trainer,让trainer进行更新策略,或者存储经验(experience replay)。 我们在训练过程中会指定每次训练对应的sample的数量,然后平均数量分配给每个worker,然后让每个worker在生成该数量sample后,将样本传递回trainer中。 这种模式叫做"truncated episode",不要求worker必须执行完当前的episode。 另外一种更新方式是"completed episode",我们要求worker必须走完episode。sample的数量大于等于我们给定的数量就可以。(一个episode不足的话,可以进行多个episode)
对于单智能体-多环境来说,我们就不是一个Env,而是多个env,对应VectorEnv。
对于多智能体-单环境来说,我们可能会有多个策略,也就是一个trainer中有多个policy,分别控制多个不同的智能体。


该图片说明了rllib中每部分的模块。 上述的trainer对应Model部分;对于preprocessor和filter会有预先的定义,可以根据自己需要来进行添加。对于环境则需要自己定义。 对于不同模块的调用将在之后介绍,本篇重点在于使用成熟的trainer,完成一套整体的训练。
使用trainer进行训练
此处使用gym中的cartpole进行训练。 如何设置自己的环境进行训练,请看下篇。 此处只使用固定参数,对于训练参数的设置,请看本篇下一节。
import ray #基本包
import ray.rllib.agents.ppo as ppo # 产生PPOTrainer的包
from ray.tune.logger import pretty_print # 将结果较好展示的函数ray.shutdown() # 防止重启ray时 已有ray在启动
ray.init() # 使用默认ppo 参数
ppoconfig = ppo.DEFAULT_CONFIG.copy()
### 修改ppo中的默认参数
ppoconfig["num_gpus"] = 0 # 不使用gpu
ppoconfig["num_workers"] = 1 # 只使用一个worker# 生成trainer
trainer = ppo.PPOTrainer(config=ppoconfig, env="CartPole-v0") #使用Gym中的环境, 对于如何使用自己创建的环境,见下篇# 训练
MAX_TRAIN_NUM = 50
for i in range(MAX_TRAIN_NUM):# 采样之后然后更新一次参数result = trainer.train()print(pretty_print(result)) # 输出此次采样的结果# 存储 训练节点if i%25==0 or i==MAX_TRAIN_NUM:checkpoint = trainer.save("checkpoints/cartpole"+str(i)) # 存储 checkpoint trainer.save(log_dir) 可以定点存储print("checkpoint saved at", checkpoint)
对于rllib中默认的评价指标:episode_length (max,min,mean) 以及 reward (max, min, mean) 等都会自动存储在 ray_results中。 ray_results会自动创建于/home/下,可以使用tensorboard直接打开查看。
trainer参数设置
COMMON_CONFIG: TrainerConfigDict = {# === Settings for Rollout Worker processes ===# Number of rollout worker actors to create for parallel sampling. Setting# this to 0 will force rollouts to be done in the trainer actor."num_workers": 2,# Number of environments to evaluate vector-wise per worker. This enables# model inference batching, which can improve performance for inference# bottlenecked workloads."num_envs_per_worker": 1,# When `num_workers` > 0, the driver (local_worker; worker-idx=0) does not# need an environment. This is because it doesn't have to sample (done by# remote_workers; worker_indices > 0) nor evaluate (done by evaluation# workers; see below)."create_env_on_driver": False,# Divide episodes into fragments of this many steps each during rollouts.# Sample batches of this size are collected from rollout workers and# combined into a larger batch of `train_batch_size` for learning.## For example, given rollout_fragment_length=100 and train_batch_size=1000:# 1. RLlib collects 10 fragments of 100 steps each from rollout workers.# 2. These fragments are concatenated and we perform an epoch of SGD.## When using multiple envs per worker, the fragment size is multiplied by# `num_envs_per_worker`. This is since we are collecting steps from# multiple envs in parallel. For example, if num_envs_per_worker=5, then# rollout workers will return experiences in chunks of 5*100 = 500 steps.## The dataflow here can vary per algorithm. For example, PPO further# divides the train batch into minibatches for multi-epoch SGD."rollout_fragment_length": 200,# How to build per-Sampler (RolloutWorker) batches, which are then# usually concat'd to form the train batch. Note that "steps" below can# mean different things (either env- or agent-steps) and depends on the# `count_steps_by` (multiagent) setting below.# truncate_episodes: Each produced batch (when calling# RolloutWorker.sample()) will contain exactly `rollout_fragment_length`# steps. This mode guarantees evenly sized batches, but increases# variance as the future return must now be estimated at truncation# boundaries.# complete_episodes: Each unroll happens exactly over one episode, from# beginning to end. Data collection will not stop unless the episode# terminates or a configured horizon (hard or soft) is hit.# 对于truncate episodes,每次更新 不要求是完整的episode,以batch size数量为准# 如果是 completer_episodes: 每次更新都是完整的episodes, batch size 是最少的经验数量(用于确定每次更新的episode的数量)"batch_mode": "truncate_episodes", # === Settings for the Trainer process ===# Discount factor of the MDP."gamma": 0.99,# The default learning rate."lr": 0.0001,# Training batch size, if applicable. Should be >= rollout_fragment_length.# Samples batches will be concatenated together to a batch of this size,# which is then passed to SGD."train_batch_size": 200,# Arguments to pass to the policy model. See models/catalog.py for a full# list of the available model options."model": MODEL_DEFAULTS,# Arguments to pass to the policy optimizer. These vary by optimizer."optimizer": {},# === Environment Settings ===# Number of steps after which the episode is forced to terminate. Defaults# to `env.spec.max_episode_steps` (if present) for Gym envs."horizon": None,# Calculate rewards but don't reset the environment when the horizon is# hit. This allows value estimation and RNN state to span across logical# episodes denoted by horizon. This only has an effect if horizon != inf."soft_horizon": False,# Don't set 'done' at the end of the episode.# In combination with `soft_horizon`, this works as follows:# - no_done_at_end=False soft_horizon=False:# Reset env and add `done=True` at end of each episode.# - no_done_at_end=True soft_horizon=False:# Reset env, but do NOT add `done=True` at end of the episode.# - no_done_at_end=False soft_horizon=True:# Do NOT reset env at horizon, but add `done=True` at the horizon# (pretending the episode has terminated).# - no_done_at_end=True soft_horizon=True:# Do NOT reset env at horizon and do NOT add `done=True` at the horizon."no_done_at_end": False,# The environment specifier:# This can either be a tune-registered env, via# `tune.register_env([name], lambda env_ctx: [env object])`,# or a string specifier of an RLlib supported type. In the latter case,# RLlib will try to interpret the specifier as either an openAI gym env,# a PyBullet env, a ViZDoomGym env, or a fully qualified classpath to an# Env class, e.g. "ray.rllib.examples.env.random_env.RandomEnv"."env": None,# The observation- and action spaces for the Policies of this Trainer.# Use None for automatically inferring these from the given env."observation_space": None,"action_space": None,# Arguments dict passed to the env creator as an EnvContext object (which# is a dict plus the properties: num_workers, worker_index, vector_index,# and remote)."env_config": {},# If using num_envs_per_worker > 1, whether to create those new envs in# remote processes instead of in the same worker. This adds overheads, but# can make sense if your envs can take much time to step / reset# (e.g., for StarCraft). Use this cautiously; overheads are significant."remote_worker_envs": False,# Timeout that remote workers are waiting when polling environments.# 0 (continue when at least one env is ready) is a reasonable default,# but optimal value could be obtained by measuring your environment# step / reset and model inference perf."remote_env_batch_wait_ms": 0,# A callable taking the last train results, the base env and the env# context as args and returning a new task to set the env to.# The env must be a `TaskSettableEnv` sub-class for this to work.# See `examples/curriculum_learning.py` for an example."env_task_fn": None,# If True, try to render the environment on the local worker or on worker# 1 (if num_workers > 0). For vectorized envs, this usually means that only# the first sub-environment will be rendered.# In order for this to work, your env will have to implement the# `render()` method which either:# a) handles window generation and rendering itself (returning True) or# b) returns a numpy uint8 image of shape [height x width x 3 (RGB)]."render_env": False,# If True, stores videos in this relative directory inside the default# output dir (~/ray_results/...). Alternatively, you can specify an# absolute path (str), in which the env recordings should be# stored instead.# Set to False for not recording anything.# Note: This setting replaces the deprecated `monitor` key."record_env": False,# Whether to clip rewards during Policy's postprocessing.# None (default): Clip for Atari only (r=sign(r)).# True: r=sign(r): Fixed rewards -1.0, 1.0, or 0.0.# False: Never clip.# [float value]: Clip at -value and + value.# Tuple[value1, value2]: Clip at value1 and value2."clip_rewards": None,# If True, RLlib will learn entirely inside a normalized action space# (0.0 centered with small stddev; only affecting Box components).# We will unsquash actions (and clip, just in case) to the bounds of# the env's action space before sending actions back to the env."normalize_actions": True,# If True, RLlib will clip actions according to the env's bounds# before sending them back to the env.# TODO: (sven) This option should be obsoleted and always be False."clip_actions": False,# Whether to use "rllib" or "deepmind" preprocessors by default# Set to None for using no preprocessor. In this case, the model will have# to handle possibly complex observations from the environment."preprocessor_pref": "deepmind",# === Debug Settings ===# Set the ray.rllib.* log level for the agent process and its workers.# Should be one of DEBUG, INFO, WARN, or ERROR. The DEBUG level will also# periodically print out summaries of relevant internal dataflow (this is# also printed out once at startup at the INFO level). When using the# `rllib train` command, you can also use the `-v` and `-vv` flags as# shorthand for INFO and DEBUG."log_level": "WARN",# Callbacks that will be run during various phases of training. See the# `DefaultCallbacks` class and `examples/custom_metrics_and_callbacks.py`# for more usage information."callbacks": DefaultCallbacks,# Whether to attempt to continue training if a worker crashes. The number# of currently healthy workers is reported as the "num_healthy_workers"# metric."ignore_worker_failures": False,# Whether - upon a worker failure - RLlib will try to recreate the lost worker as# an identical copy of the failed one. The new worker will only differ from the# failed one in its `self.recreated_worker=True` property value. It will have# the same `worker_index` as the original one.# If True, the `ignore_worker_failures` setting will be ignored."recreate_failed_workers": False,# Log system resource metrics to results. This requires `psutil` to be# installed for sys stats, and `gputil` for GPU metrics."log_sys_usage": True,# Use fake (infinite speed) sampler. For testing only."fake_sampler": False,# === Deep Learning Framework Settings ===# tf: TensorFlow (static-graph)# tf2: TensorFlow 2.x (eager or traced, if eager_tracing=True)# tfe: TensorFlow eager (or traced, if eager_tracing=True)# torch: PyTorch"framework": "tf",# Enable tracing in eager mode. This greatly improves performance# (speedup ~2x), but makes it slightly harder to debug since Python# code won't be evaluated after the initial eager pass.# Only possible if framework=[tf2|tfe]."eager_tracing": False,# Maximum number of tf.function re-traces before a runtime error is raised.# This is to prevent unnoticed retraces of methods inside the# `..._eager_traced` Policy, which could slow down execution by a# factor of 4, without the user noticing what the root cause for this# slowdown could be.# Only necessary for framework=[tf2|tfe].# Set to None to ignore the re-trace count and never throw an error."eager_max_retraces": 20,# === Exploration Settings ===# Default exploration behavior, iff `explore`=None is passed into# compute_action(s).# Set to False for no exploration behavior (e.g., for evaluation)."explore": True,# Provide a dict specifying the Exploration object's config."exploration_config": {# The Exploration class to use. In the simplest case, this is the name# (str) of any class present in the `rllib.utils.exploration` package.# You can also provide the python class directly or the full location# of your class (e.g. "ray.rllib.utils.exploration.epsilon_greedy.# EpsilonGreedy")."type": "StochasticSampling",# Add constructor kwargs here (if any).},# === Evaluation Settings ===# Evaluate with every `evaluation_interval` training iterations.# The evaluation stats will be reported under the "evaluation" metric key.# Note that for Ape-X metrics are already only reported for the lowest# epsilon workers (least random workers).# Set to None (or 0) for no evaluation."evaluation_interval": None,# Duration for which to run evaluation each `evaluation_interval`.# The unit for the duration can be set via `evaluation_duration_unit` to# either "episodes" (default) or "timesteps".# If using multiple evaluation workers (evaluation_num_workers > 1),# the load to run will be split amongst these.# If the value is "auto":# - For `evaluation_parallel_to_training=True`: Will run as many# episodes/timesteps that fit into the (parallel) training step.# - For `evaluation_parallel_to_training=False`: Error."evaluation_duration": 10,# The unit, with which to count the evaluation duration. Either "episodes"# (default) or "timesteps"."evaluation_duration_unit": "episodes",# Whether to run evaluation in parallel to a Trainer.train() call# using threading. Default=False.# E.g. evaluation_interval=2 -> For every other training iteration,# the Trainer.train() and Trainer.evaluate() calls run in parallel.# Note: This is experimental. Possible pitfalls could be race conditions# for weight synching at the beginning of the evaluation loop."evaluation_parallel_to_training": False,# Internal flag that is set to True for evaluation workers."in_evaluation": False,# Typical usage is to pass extra args to evaluation env creator# and to disable exploration by computing deterministic actions.# IMPORTANT NOTE: Policy gradient algorithms are able to find the optimal# policy, even if this is a stochastic one. Setting "explore=False" here# will result in the evaluation workers not using this optimal policy!"evaluation_config": {# Example: overriding env_config, exploration, etc:# "env_config": {...},# "explore": False},# === Replay Buffer Settings ===# Provide a dict specifying the ReplayBuffer's config.# "replay_buffer_config": {# The ReplayBuffer class to use. Any class that obeys the# ReplayBuffer API can be used here. In the simplest case, this is the# name (str) of any class present in the `rllib.utils.replay_buffers`# package. You can also provide the python class directly or the# full location of your class (e.g.# "ray.rllib.utils.replay_buffers.replay_buffer.ReplayBuffer").# "type": "ReplayBuffer",# The capacity of units that can be stored in one ReplayBuffer# instance before eviction.# "capacity": 10000,# Specifies how experiences are stored. Either 'sequences' or# 'timesteps'.# "storage_unit": "timesteps",# Add constructor kwargs here (if any).# },# Number of parallel workers to use for evaluation. Note that this is set# to zero by default, which means evaluation will be run in the trainer# process (only if evaluation_interval is not None). If you increase this,# it will increase the Ray resource usage of the trainer since evaluation# workers are created separately from rollout workers (used to sample data# for training)."evaluation_num_workers": 0,# Customize the evaluation method. This must be a function of signature# (trainer: Trainer, eval_workers: WorkerSet) -> metrics: dict. See the# Trainer.evaluate() method to see the default implementation.# The Trainer guarantees all eval workers have the latest policy state# before this function is called."custom_eval_function": None,# Make sure the latest available evaluation results are always attached to# a step result dict.# This may be useful if Tune or some other meta controller needs access# to evaluation metrics all the time."always_attach_evaluation_results": False,# Store raw custom metrics without calculating max, min, mean"keep_per_episode_custom_metrics": False,# === Advanced Rollout Settings ===# Use a background thread for sampling (slightly off-policy, usually not# advisable to turn on unless your env specifically requires it)."sample_async": False,# The SampleCollector class to be used to collect and retrieve# environment-, model-, and sampler data. Override the SampleCollector base# class to implement your own collection/buffering/retrieval logic."sample_collector": SimpleListCollector,# Element-wise observation filter, either "NoFilter" or "MeanStdFilter"."observation_filter": "NoFilter",# Whether to synchronize the statistics of remote filters."synchronize_filters": True,# Configures TF for single-process operation by default."tf_session_args": {# note: overridden by `local_tf_session_args`"intra_op_parallelism_threads": 2,"inter_op_parallelism_threads": 2,"gpu_options": {"allow_growth": True,},"log_device_placement": False,"device_count": {"CPU": 1},# Required by multi-GPU (num_gpus > 1)."allow_soft_placement": True,},# Override the following tf session args on the local worker"local_tf_session_args": {# Allow a higher level of parallelism by default, but not unlimited# since that can cause crashes with many concurrent drivers."intra_op_parallelism_threads": 8,"inter_op_parallelism_threads": 8,},# Whether to LZ4 compress individual observations."compress_observations": False,# Wait for metric batches for at most this many seconds. Those that# have not returned in time will be collected in the next train iteration."metrics_episode_collection_timeout_s": 180,# Smooth metrics over this many episodes."metrics_num_episodes_for_smoothing": 100,# Minimum time interval to run one `train()` call for:# If - after one `step_attempt()`, this time limit has not been reached,# will perform n more `step_attempt()` calls until this minimum time has# been consumed. Set to None or 0 for no minimum time."min_time_s_per_reporting": None,# Minimum train/sample timesteps to optimize for per `train()` call.# This value does not affect learning, only the length of train iterations.# If - after one `step_attempt()`, the timestep counts (sampling or# training) have not been reached, will perform n more `step_attempt()`# calls until the minimum timesteps have been executed.# Set to None or 0 for no minimum timesteps."min_train_timesteps_per_reporting": None,"min_sample_timesteps_per_reporting": None,# This argument, in conjunction with worker_index, sets the random seed of# each worker, so that identically configured trials will have identical# results. This makes experiments reproducible."seed": None,# Any extra python env vars to set in the trainer process, e.g.,# {"OMP_NUM_THREADS": "16"}"extra_python_environs_for_driver": {},# The extra python environments need to set for worker processes."extra_python_environs_for_worker": {},# === Resource Settings ===# Number of GPUs to allocate to the trainer process. Note that not all# algorithms can take advantage of trainer GPUs. Support for multi-GPU# is currently only available for tf-[PPO/IMPALA/DQN/PG].# This can be fractional (e.g., 0.3 GPUs)."num_gpus": 0,# Set to True for debugging (multi-)?GPU funcitonality on a CPU machine.# GPU towers will be simulated by graphs located on CPUs in this case.# Use `num_gpus` to test for different numbers of fake GPUs."_fake_gpus": False,# Number of CPUs to allocate per worker."num_cpus_per_worker": 1,# Number of GPUs to allocate per worker. This can be fractional. This is# usually needed only if your env itself requires a GPU (i.e., it is a# GPU-intensive video game), or model inference is unusually expensive."num_gpus_per_worker": 0,# Any custom Ray resources to allocate per worker."custom_resources_per_worker": {},# Number of CPUs to allocate for the trainer. Note: this only takes effect# when running in Tune. Otherwise, the trainer runs in the main program."num_cpus_for_driver": 1,# The strategy for the placement group factory returned by# `Trainer.default_resource_request()`. A PlacementGroup defines, which# devices (resources) should always be co-located on the same node.# For example, a Trainer with 2 rollout workers, running with# num_gpus=1 will request a placement group with the bundles:# [{"gpu": 1, "cpu": 1}, {"cpu": 1}, {"cpu": 1}], where the first bundle is# for the driver and the other 2 bundles are for the two workers.# These bundles can now be "placed" on the same or different# nodes depending on the value of `placement_strategy`:# "PACK": Packs bundles into as few nodes as possible.# "SPREAD": Places bundles across distinct nodes as even as possible.# "STRICT_PACK": Packs bundles into one node. The group is not allowed# to span multiple nodes.# "STRICT_SPREAD": Packs bundles across distinct nodes."placement_strategy": "PACK",# TODO(jungong, sven): we can potentially unify all input types# under input and input_config keys. E.g.# input: sample# input_config {# env: Cartpole-v0# }# or:# input: json_reader# input_config {# path: /tmp/# }# or:# input: dataset# input_config {# format: parquet# path: /tmp/# }# === Offline Datasets ===# Specify how to generate experiences:# - "sampler": Generate experiences via online (env) simulation (default).# - A local directory or file glob expression (e.g., "/tmp/*.json").# - A list of individual file paths/URIs (e.g., ["/tmp/1.json",# "s3://bucket/2.json"]).# - A dict with string keys and sampling probabilities as values (e.g.,# {"sampler": 0.4, "/tmp/*.json": 0.4, "s3://bucket/expert.json": 0.2}).# - A callable that takes an `IOContext` object as only arg and returns a# ray.rllib.offline.InputReader.# - A string key that indexes a callable with tune.registry.register_input"input": "sampler",# Arguments accessible from the IOContext for configuring custom input"input_config": {},# True, if the actions in a given offline "input" are already normalized# (between -1.0 and 1.0). This is usually the case when the offline# file has been generated by another RLlib algorithm (e.g. PPO or SAC),# while "normalize_actions" was set to True."actions_in_input_normalized": False,# Specify how to evaluate the current policy. This only has an effect when# reading offline experiences ("input" is not "sampler").# Available options:# - "wis": the weighted step-wise importance sampling estimator.# - "is": the step-wise importance sampling estimator.# - "simulation": run the environment in the background, but use# this data for evaluation only and not for learning."input_evaluation": ["is", "wis"],# Whether to run postprocess_trajectory() on the trajectory fragments from# offline inputs. Note that postprocessing will be done using the *current*# policy, not the *behavior* policy, which is typically undesirable for# on-policy algorithms."postprocess_inputs": False,# If positive, input batches will be shuffled via a sliding window buffer# of this number of batches. Use this if the input data is not in random# enough order. Input is delayed until the shuffle buffer is filled."shuffle_buffer_size": 0,# Specify where experiences should be saved:# - None: don't save any experiences# - "logdir" to save to the agent log dir# - a path/URI to save to a custom output directory (e.g., "s3://bucket/")# - a function that returns a rllib.offline.OutputWriter"output": None,# Arguments accessible from the IOContext for configuring custom output"output_config": {},# What sample batch columns to LZ4 compress in the output data."output_compress_columns": ["obs", "new_obs"],# Max output file size (in bytes) before rolling over to a new file."output_max_file_size": 64 * 1024 * 1024,# === Settings for Multi-Agent Environments ==="multiagent": {# Map of type MultiAgentPolicyConfigDict from policy ids to tuples# of (policy_cls, obs_space, act_space, config). This defines the# observation and action spaces of the policies and any extra config."policies": {},# Keep this many policies in the "policy_map" (before writing# least-recently used ones to disk/S3)."policy_map_capacity": 100,# Where to store overflowing (least-recently used) policies?# Could be a directory (str) or an S3 location. None for using# the default output dir."policy_map_cache": None,# Function mapping agent ids to policy ids."policy_mapping_fn": None,# Determines those policies that should be updated.# Options are:# - None, for all policies.# - An iterable of PolicyIDs that should be updated.# - A callable, taking a PolicyID and a SampleBatch or MultiAgentBatch# and returning a bool (indicating whether the given policy is trainable# or not, given the particular batch). This allows you to have a policy# trained only on certain data (e.g. when playing against a certain# opponent)."policies_to_train": None,# Optional function that can be used to enhance the local agent# observations to include more state.# See rllib/evaluation/observation_function.py for more info."observation_fn": None,# When replay_mode=lockstep, RLlib will replay all the agent# transitions at a particular timestep together in a batch. This allows# the policy to implement differentiable shared computations between# agents it controls at that timestep. When replay_mode=independent,# transitions are replayed independently per policy."replay_mode": "independent",# Which metric to use as the "batch size" when building a# MultiAgentBatch. The two supported values are:# env_steps: Count each time the env is "stepped" (no matter how many# multi-agent actions are passed/how many multi-agent observations# have been returned in the previous step).# agent_steps: Count each individual agent step as one step."count_steps_by": "env_steps",},# === Logger ===# Define logger-specific configuration to be used inside Logger# Default value None allows overwriting with nested dicts"logger_config": None,# === API deprecations/simplifications/changes ===# Experimental flag.# If True, TFPolicy will handle more than one loss/optimizer.# Set this to True, if you would like to return more than# one loss term from your `loss_fn` and an equal number of optimizers# from your `optimizer_fn`.# In the future, the default for this will be True."_tf_policy_handles_more_than_one_loss": False,# Experimental flag.# If True, no (observation) preprocessor will be created and# observations will arrive in model as they are returned by the env.# In the future, the default for this will be True."_disable_preprocessor_api": False,# Experimental flag.# If True, RLlib will no longer flatten the policy-computed actions into# a single tensor (for storage in SampleCollectors/output files/etc..),# but leave (possibly nested) actions as-is. Disabling flattening affects:# - SampleCollectors: Have to store possibly nested action structs.# - Models that have the previous action(s) as part of their input.# - Algorithms reading from offline files (incl. action information)."_disable_action_flattening": False,# Experimental flag.# If True, the execution plan API will not be used. Instead,# a Trainer's `training_iteration` method will be called as-is each# training iteration."_disable_execution_plan_api": False,# If True, disable the environment pre-checking module."disable_env_checking": False,# === Deprecated keys ===# Uses the sync samples optimizer instead of the multi-gpu one. This is# usually slower, but you might want to try it if you run into issues with# the default optimizer.# This will be set automatically from now on."simple_optimizer": DEPRECATED_VALUE,# Whether to write episode stats and videos to the agent log dir. This is# typically located in ~/ray_results."monitor": DEPRECATED_VALUE,# Replaced by `evaluation_duration=10` and# `evaluation_duration_unit=episodes`."evaluation_num_episodes": DEPRECATED_VALUE,# Use `metrics_num_episodes_for_smoothing` instead."metrics_smoothing_episodes": DEPRECATED_VALUE,# Use `min_[env|train]_timesteps_per_reporting` instead."timesteps_per_iteration": 0,# Use `min_time_s_per_reporting` instead."min_iter_time_s": DEPRECATED_VALUE,# Use `metrics_episode_collection_timeout_s` instead."collect_metrics_timeout": DEPRECATED_VALUE,
}
获取训练好的模型/policy
import ray #基本包
import ray.rllib.agents.ppo as ppo # 产生PPOTrainer的包
from ray.tune.logger import pretty_print # 将结果较好展示的函数ray.shutdown() # 防止重启ray时 已有ray在启动
ray.init() # 使用默认ppo 参数
ppoconfig = ppo.DEFAULT_CONFIG.copy()
### 修改ppo中的默认参数
ppoconfig["num_gpus"] = 0 # 不使用gpu
ppoconfig["num_workers"] = 1 # 只使用一个worker# 生成trainer
trainer = ppo.PPOTrainer(config=ppoconfig, env="CartPole-v0") #使用Gym中的环境, 对于如何使用自己创建的环境,见下篇trainer.restore("./checkpoints/cartpole25/checkpoint_000026/checkpoint-26") # 加载之前生成的checkpoint##### 可以直接使用
trainer.compute_action(obs) #来计算动作输出## 从trainer中提取出policy
##### 提取policy
policy = trainer.get_policy()polciy.compute_single_action(obs) #获取结果
本来打算直接生成TFPolicy,但是直接生成时出现问题。因此还是只能先生成trainer,然后生成policy去计算结果。
policy 模型参数设置
可以在ModelConfigDict中设置 全连接层,卷积层和RNN等。
MODEL_DEFAULTS: ModelConfigDict = {# Experimental flag.# If True, try to use a native (tf.keras.Model or torch.Module) default# model instead of our built-in ModelV2 defaults.# If False (default), use "classic" ModelV2 default models.# Note that this currently only works for:# 1) framework != torch AND# 2) fully connected and CNN default networks as well as# auto-wrapped LSTM- and attention nets."_use_default_native_models": False,# Experimental flag.# If True, user specified no preprocessor to be created# (via config._disable_preprocessor_api=True). If True, observations# will arrive in model as they are returned by the env."_disable_preprocessor_api": False,# Experimental flag.# If True, RLlib will no longer flatten the policy-computed actions into# a single tensor (for storage in SampleCollectors/output files/etc..),# but leave (possibly nested) actions as-is. Disabling flattening affects:# - SampleCollectors: Have to store possibly nested action structs.# - Models that have the previous action(s) as part of their input.# - Algorithms reading from offline files (incl. action information)."_disable_action_flattening": False,# === Built-in options ===# FullyConnectedNetwork (tf and torch): rllib.models.tf|torch.fcnet.py# These are used if no custom model is specified and the input space is 1D.# Number of hidden layers to be used."fcnet_hiddens": [256, 256],# Activation function descriptor.# Supported values are: "tanh", "relu", "swish" (or "silu"),# "linear" (or None)."fcnet_activation": "tanh",# VisionNetwork (tf and torch): rllib.models.tf|torch.visionnet.py# These are used if no custom model is specified and the input space is 2D.# Filter config: List of [out_channels, kernel, stride] for each filter.# Example:# Use None for making RLlib try to find a default filter setup given the# observation space."conv_filters": None,# Activation function descriptor.# Supported values are: "tanh", "relu", "swish" (or "silu"),# "linear" (or None)."conv_activation": "relu",# Some default models support a final FC stack of n Dense layers with given# activation:# - Complex observation spaces: Image components are fed through# VisionNets, flat Boxes are left as-is, Discrete are one-hot'd, then# everything is concated and pushed through this final FC stack.# - VisionNets (CNNs), e.g. after the CNN stack, there may be# additional Dense layers.# - FullyConnectedNetworks will have this additional FCStack as well# (that's why it's empty by default)."post_fcnet_hiddens": [],"post_fcnet_activation": "relu",# For DiagGaussian action distributions, make the second half of the model# outputs floating bias variables instead of state-dependent. This only# has an effect is using the default fully connected net."free_log_std": False,# Whether to skip the final linear layer used to resize the hidden layer# outputs to size `num_outputs`. If True, then the last hidden layer# should already match num_outputs."no_final_linear": False,# Whether layers should be shared for the value function."vf_share_layers": True,# == LSTM ==# Whether to wrap the model with an LSTM."use_lstm": False,# Max seq len for training the LSTM, defaults to 20."max_seq_len": 20,# Size of the LSTM cell."lstm_cell_size": 256,# Whether to feed a_{t-1} to LSTM (one-hot encoded if discrete)."lstm_use_prev_action": False,# Whether to feed r_{t-1} to LSTM."lstm_use_prev_reward": False,# Whether the LSTM is time-major (TxBx..) or batch-major (BxTx..)."_time_major": False,# == Attention Nets (experimental: torch-version is untested) ==# Whether to use a GTrXL ("Gru transformer XL"; attention net) as the# wrapper Model around the default Model."use_attention": False,# The number of transformer units within GTrXL.# A transformer unit in GTrXL consists of a) MultiHeadAttention module and# b) a position-wise MLP."attention_num_transformer_units": 1,# The input and output size of each transformer unit."attention_dim": 64,# The number of attention heads within the MultiHeadAttention units."attention_num_heads": 1,# The dim of a single head (within the MultiHeadAttention units)."attention_head_dim": 32,# The memory sizes for inference and training."attention_memory_inference": 50,"attention_memory_training": 50,# The output dim of the position-wise MLP."attention_position_wise_mlp_dim": 32,# The initial bias values for the 2 GRU gates within a transformer unit."attention_init_gru_gate_bias": 2.0,# Whether to feed a_{t-n:t-1} to GTrXL (one-hot encoded if discrete)."attention_use_n_prev_actions": 0,# Whether to feed r_{t-n:t-1} to GTrXL."attention_use_n_prev_rewards": 0,# == Atari ==# Set to True to enable 4x stacking behavior."framestack": True,# Final resized frame dimension"dim": 84,# (deprecated) Converts ATARI frame to 1 Channel Grayscale image"grayscale": False,# (deprecated) Changes frame to range from [-1, 1] if true"zero_mean": True,# === Options for custom models ===# Name of a custom model to use"custom_model": None,# Extra options to pass to the custom classes. These will be available to# the Model's constructor in the model_config field. Also, they will be# attempted to be passed as **kwargs to ModelV2 models. For an example,# see rllib/models/[tf|torch]/attention_net.py."custom_model_config": {},# Name of a custom action distribution to use."custom_action_dist": None,# Custom preprocessors are deprecated. Please use a wrapper class around# your environment instead to preprocess observations."custom_preprocessor": None,# Deprecated keys:# Use `lstm_use_prev_action` or `lstm_use_prev_reward` instead."lstm_use_prev_action_reward": DEPRECATED_VALUE,
}在trainer中 可以通过model来传递参数
algo_config = {# All model-related settings go into this sub-dict."model": {# By default, the MODEL_DEFAULTS dict above will be used.# Change individual keys in that dict by overriding them, e.g."fcnet_hiddens": [512, 512, 512],"fcnet_activation": "relu",},# ... other Trainer config keys, e.g. "lr" ..."lr": 0.00001,
}
直接使用tune进行强化学习
基本算法 + 算法参数 + 环境定义 + 终止参数调节
import ray
import ray.tune as tunealgo_config = {# 环境信息"env": "CartPole-v0", # "my_env" 需要提前注册好, 注册方法附后"env_config":{ } , # 环境生成"log_level":"INFO",# 模型信息"model":{# cnn"conv_filters":[], # [ [output_channel, kernel, stride] ]: [ [16,[4,4],2], [128,[6,6],3] ]"conv_activation":"relu",# 全链接层"fcnet_hiddens": [256,256],"fcnet_activation":"tanh",# post fcnet # 有时候我们的网络输入是 复杂的数据类型: matrix + vector,# 我们想要 matrix经过CNN,之后和vector合并,然后经过全连接层# 此时我们就可以设置 fcnet为 None, 然后使用 post fcnet"post_fcnet_hiddens": [], # [256,256]"post_fcnet_activation": "linear" , # "relu"#value policy 共用部分网络 可以自行设置 true or false"vf_share_layers": True, ## LSTM 设置# Whether to wrap the model with an LSTM."use_lstm": False,# Max seq len for training the LSTM, defaults to 20."max_seq_len": 20,# Size of the LSTM cell."lstm_cell_size": 256,# Whether to feed a_{t-1} to LSTM (one-hot encoded if discrete)."lstm_use_prev_action": False,# Whether to feed r_{t-1} to LSTM."lstm_use_prev_reward": False,# Whether the LSTM is time-major (TxBx..) or batch-major (BxTx..)."_time_major": False, # 还有 preprocessor, attention, action等可以进行设置, 具体附后 },# learning parameters"lr": tune.grid_search([0.0001,0.005]), # 会使用不同的learning rate进行实验"gamma":0.99,# 对于不设置的参数,会自行进行设置默认值# train batch"rollout_fragment_length": 200,"train_batch_size": 400,"batch_mode": "truncate_episodes", # 也可以设置 "complete_episodes"}analysis = tune.run('PPO',config= algo_config,stop={"episode_reward_mean":100, # 哪个条件先达到,都会结束 "timesteps_total":4000 # 条件是 result = trainer.train() ,result中的 信息}
)print("best config: ", analysis.get_best_config(metric="episode_reward_mean", mode="max"))
如何在tune中建立自己的训练过程 后续文章会讲。
引用
强化学习框架RLlib教程
本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!
