Policies API

class layeredrl.policies.Policy(action_space: Box, device=device(type='cpu'))[source]

Bases: ABC, Module

Policy mapping env obs, level input, and level state to actions.

__init__(action_space: Box, device=device(type='cpu'))[source]

Initialize the policy.

Parameters:
  • action_space – The action space the policy output has to lie in.

  • device – The device to use.

get_action(mapped_env_obs: Tensor, level_input: Tensor | None, level_state: Dict | None, deterministic: bool) Tensor[source]

Get an action for the given observation.

Parameters:
  • mapped_env_obs – The observation from the environment after the env_obs_map has been applied.

  • level_input – The input to this level, i.e., the action from the level above.

  • level_state – The state of the level.

  • deterministic – Whether to return a deterministic action (as opposed to a stochastic one).

Returns:

The action, and a Batch with info about the action (logits etc.)

get_log_prob(mapped_env_obs: Tensor, level_input: Tensor | None, level_state: Dict | None, action: Tensor) Tensor[source]

Get the log probability of the given action under the policy.

Parameters:
  • mapped_env_obs – The observation from the environment after the env_obs_map has been applied.

  • level_input – The input to this level, i.e., the action from the level above.

  • level_state – The state of the level.

  • action – The action.

Returns:

The log probability of the action under the policy.

abstractmethod reset() None[source]

Reset the policy, e.g. at the beginning of the episode.

transform_action(action: Tensor) Tensor[source]

Transform the raw action to the action space of the environment.

Parameters:

action – The raw action.

Returns:

The transformed action (now in the action space of the environment).

untransform_action(action: Tensor) Tensor[source]

Undo the transformation of the action.

Parameters:

action – The action in the action space of the environment.

Returns:

The action in the raw action space ([-1, 1]^n for continuous action spaces).

class layeredrl.policies.UniformPolicy(action_space: Box, device=device(type='cpu'))[source]

Bases: Policy

A policy that randomly samples actions from the action space.

reset() None[source]

Reset the policy, e.g. at the beginning of the episode.

class layeredrl.policies.TianshouPolicy(action_space: Space, ts_policy: BasePolicy, device=device(type='cpu'))[source]

Bases: Policy

Wrapper around a tianshou policy.

Note: Only tested with SACPolicy and DQNPolicy at the moment.

__init__(action_space: Space, ts_policy: BasePolicy, device=device(type='cpu'))[source]

Initialize the policy.

Parameters:
  • action_space – The action space of the environment the policy acts in.

  • ts_policy – The tianshou policy.

  • device – The device to use.

get_log_prob(mapped_env_obs: Tensor, level_input: Tensor | None, level_state: Dict | None, action: Tensor, std: float | None = None) Tensor[source]

Get the log probability of the given action under the policy.

Parameters:
  • mapped_env_obs – The observation from the environment after the env_obs_map has been applied.

  • level_input – The input to this level, i.e., the action from the level above.

  • level_state – The state of the level.

  • action – The action.

  • std – The standard deviation of the action distribution. Overwrites the std from the policy if given.

Returns:

The log probability of the action under the policy.

get_value(mapped_env_obs: Tensor, level_input: Tensor | None, level_state: Dict | None, action: Tensor) Tensor[source]

Get the value of the given obs and action as predicted by the critic.

Parameters:
  • mapped_env_obs – The observation from the environment after the env_obs_map has been applied.

  • level_input – The input to this level, i.e., the action from the level above.

  • level_state – The state of the level.

  • action – The action.

Returns:

The value of the given obs and action as predicted by the critic.

reset() None[source]

Reset the policy, e.g. at the beginning of the episode.

update(sample_size: int, buffer: ReplayBuffer) None[source]

Update the policy.

Parameters:
  • sample_size – The number of samples in one batch.

  • buffer – The replay buffer.