Policies API

class layeredrl.policies.Policy(action_space: Box, device=device(type='cpu'))[source]

Bases: ABC, Module

Policy mapping env obs, level input, and level state to actions.

__init__(action_space: Box, device=device(type='cpu'))[source]

Initialize the policy.

Parameters:

action_space – The action space the policy output has to lie in.
device – The device to use.

get_action(mapped_env_obs: Tensor, level_input: Tensor | None, level_state: Dict | None, deterministic: bool) → Tensor[source]

Get an action for the given observation.

Parameters:

mapped_env_obs – The observation from the environment after the env_obs_map has been applied.
level_input – The input to this level, i.e., the action from the level above.
level_state – The state of the level.
deterministic – Whether to return a deterministic action (as opposed to a stochastic one).

Returns:

The action, and a Batch with info about the action (logits etc.)

get_log_prob(mapped_env_obs: Tensor, level_input: Tensor | None, level_state: Dict | None, action: Tensor) → Tensor[source]

Get the log probability of the given action under the policy.

Parameters:

mapped_env_obs – The observation from the environment after the env_obs_map has been applied.
level_input – The input to this level, i.e., the action from the level above.
level_state – The state of the level.
action – The action.

Returns:

The log probability of the action under the policy.

abstractmethod reset() → None[source]: Reset the policy, e.g. at the beginning of the episode.

transform_action(action: Tensor) → Tensor[source]

Transform the raw action to the action space of the environment.

Parameters:: action – The raw action.
Returns:: The transformed action (now in the action space of the environment).

untransform_action(action: Tensor) → Tensor[source]

Undo the transformation of the action.

Parameters:: action – The action in the action space of the environment.
Returns:: The action in the raw action space ([-1, 1]^n for continuous action spaces).

class layeredrl.policies.UniformPolicy(action_space: Box, device=device(type='cpu'))[source]

Bases: Policy

A policy that randomly samples actions from the action space.

reset() → None[source]: Reset the policy, e.g. at the beginning of the episode.

class layeredrl.policies.TianshouPolicy(action_space: Space, ts_policy: BasePolicy, device=device(type='cpu'))[source]

Bases: Policy

Wrapper around a tianshou policy.

Note: Only tested with SACPolicy and DQNPolicy at the moment.

__init__(action_space: Space, ts_policy: BasePolicy, device=device(type='cpu'))[source]

Initialize the policy.

Parameters:

action_space – The action space of the environment the policy acts in.
ts_policy – The tianshou policy.
device – The device to use.

get_log_prob(mapped_env_obs: Tensor, level_input: Tensor | None, level_state: Dict | None, action: Tensor, std: float | None = None) → Tensor[source]

Get the log probability of the given action under the policy.

Parameters:

mapped_env_obs – The observation from the environment after the env_obs_map has been applied.
level_input – The input to this level, i.e., the action from the level above.
level_state – The state of the level.
action – The action.
std – The standard deviation of the action distribution. Overwrites the std from the policy if given.

Returns:

The log probability of the action under the policy.

get_value(mapped_env_obs: Tensor, level_input: Tensor | None, level_state: Dict | None, action: Tensor) → Tensor[source]

Get the value of the given obs and action as predicted by the critic.

Parameters:

mapped_env_obs – The observation from the environment after the env_obs_map has been applied.
level_input – The input to this level, i.e., the action from the level above.
level_state – The state of the level.
action – The action.

Returns:

The value of the given obs and action as predicted by the critic.

reset() → None[source]: Reset the policy, e.g. at the beginning of the episode.

update(sample_size: int, buffer: ReplayBuffer) → None[source]

Update the policy.

Parameters:

sample_size – The number of samples in one batch.
buffer – The replay buffer.