Levels API
- class layeredrl.levels.Level(device: device = device(type='cpu'), writer: SummaryWriter | None = None, log_interval: int = 1000)[source]
Bases:
ABC- __init__(device: device = device(type='cpu'), writer: SummaryWriter | None = None, log_interval: int = 1000)[source]
Create the level of the hierarchy.
- Parameters:
device – The device to use.
writer – The TensorBoard writer to use for logging. If None, no logging is done.
log_interval – The interval in which to log statistics in total (summed) environment steps.
- add_transitions(transitions: Batch) None[source]
Add a transition to the level.
- Parameters:
transitions – The batch of transitions to add. The first dimension corresponds to the batch dimension (e.g. environments). The following keys must be included: obs: The observation. act: The action. rew: The reward. terminated: Whether the episode terminated. truncated: Whether the episode was truncated. obs_next: The next observation.
- abstractmethod get_action(mapped_env_obs: Tensor, level_input: Tensor | None, level_input_info: Dict | None, active_instances: Tensor = tensor([True])) Tuple[Tensor, Dict][source]
Get an action for the given observation.
Note that only the action for the active instances is returned.
Call this at the beginning of the implementation of get_action in derived classes.
- Parameters:
mapped_env_obs – The environment observation after the self.env_obs_map has been applied. Note that the observation has a batch dimension (for multiple environment instances).
level_input – The input to this level for the active instances, i.e., the action from the level above.
level_input_info – Additional information about the level input.
active_instances – In which of the environment instances the level is active. env_obs and level_input correspond to these instances.
- Returns:
The action (also with a batch dimension). An info dict containing additional information about the action.
- get_action_shape_and_type() Tuple[Size, dtype][source]
Get the shape and type of the action.
- Returns:
The shape and type of the action.
- get_copy() Level[source]
Get a copy of the level.
Resets the copied level.
No parameters are copied, only the state of the level. The copy of the level can be used for testing rollouts without influencing the state of the original level, for example.
- Returns:
A copy of the level.
- abstractmethod get_input_space() Space[source]
Get the input space of level.
Input space denotes everything the level needs except for the environment observation, e.g. a skill vector.
- Returns:
The input space (or None if no input is required).
- get_level_state(mapped_env_obs: Tensor, level_input: Tensor, active_instances: Tensor) Dict[source]
Compute the state of the level.
This contains all information relevant for determining the level action except for the environment observation and the level input, which are handled separately.
- Parameters:
mapped_env_obs – The mapped environment observation of the active instances.
level_input – The input to this level for the active instances, i.e., the action from the level above.
active_instances – The instances for which to return the level state.
- Returns:
The level state.
- get_level_state_dims() Dict[str, int][source]
Get the dimensions of the level state.
Implement this in derived classes.
- Returns:
A dictionary with the dimensions of each item in the level state.
- get_reward(mapped_env_obs: Tensor, level_input: Tensor | None, level_state: Dict, action: Tensor, next_mapped_env_obs: Tensor, next_level_state: Dict, terminated: bool, cum_env_reward: Tensor, elapsed_env_steps: Tensor) Tuple[Tensor, Dict][source]
Get the reward for the given batch of transitions.
Note that everything has a batch dimension. Also note that there is no next_level_input because the level input doeos not change while the level is in control.
- Parameters:
mapped_env_obs – The mapped environment observation.
level_input – The input to this level.
level_state – The state of the level.
action – The action this level took.
next_mapped_env_obs – The next mapped environment observation.
next_level_state – The next state of the level.
terminated – Whether the episode terminated.
cum_env_reward – The sum of the environment rewards that were received during the transition.
elapsed_env_steps – The number of environment steps that elapsed during the transition.
- Returns:
The reward and the reward info dict.
- initialize(env_obs_space: Space, action_space: Space, n_env_instances: int, parent_predictor: Predictor, env_obs_map: Callable[[Tensor], Tensor] | None = None, mapped_env_obs_shape: Tuple | None = None, keep_params: bool = False) None[source]
Construct the level.
- Parameters:
env_obs_space – The observation space of the environment.
action_space – The action space of the level. If this is the lowest level, this is the action space of the environment. Otherwise, it is the input space of the level below this one.
n_env_instances – The number of environment instances.
parent_predictor – The predictor of the parent level (or None if there is no parent or the parent does not have a predictor).
env_obs_map – A map that is applied to the environment observation. This can be used to implement information hiding and for moving a trained level from one environment to another with a different observation space. If None, the identity map is used.
mapped_env_obs_shape – Shape of the output of env_obs_map. If None, the shape of the environment observation space is used.
keep_params – Whether to keep the parameters of the level (e.g. the policy) when initializing. If False, the parameters are reset.
- process_transition(mapped_env_obs: Tensor, level_input: Tensor | None, action: Tensor, next_mapped_env_obs: Tensor, terminated: Tensor, truncated: Tensor, active_instances: Tensor) bool[source]
Process transition and check whether level would like to return control to the level above.
This usually involves adding the transition to the replay buffer and possibly preprocessing it.
Note that everything has a batch dimension.
- Parameters:
mapped_env_obs – The mapped environment observation for the active instances.
level_input – The input to this level for the active instances, i.e., the action from the level above.
action – The action that was taken by the level.
next_mapped_env_obs – The next mapped environment observation for the active instances.
terminated – Whether the episode terminated for the active instances.
truncated – Whether the episode was truncated for the active instances.
active_instances – In which of the environment instances the level is active. next_obs and terminated correspond to these instances.
- Returns:
Whether the level is done, i.e. whether it hands control back to the level above.
- register_env_reward(env_rew: Tensor) None[source]
Register the environment reward with the level.
- Parameters:
env_rew – The environment reward.
- sample_from_action_space(batch_size: int) Tensor[source]
Sample from the action space.
- Parameters:
batch_size – The batch size.
- Returns:
The sampled actions.
- set_n_env_instances(n_env_instances: int) None[source]
Set the number of environment instances.
- Parameters:
n_env_instances – The number of environment instances.
- class layeredrl.levels.RandomLevel(device: device = device(type='cpu'), writer: SummaryWriter | None = None, log_interval: int = 1000)[source]
Bases:
LevelA level that samples random actions.
- get_action(mapped_env_obs: Tensor, level_input: Tensor | None, level_input_info: Dict | None, active_instances: Tensor = tensor([True])) Tuple[Tensor, Dict][source]
Get a random action.
Note that only the action for the active instances is returned.
- Parameters:
mapped_env_obs – The environment observation after the self.env_obs_map has been applied. Note that the observation has a batch dimension (for multiple environment instances).
level_input – The input to this level for the active instances, i.e., the action from the level above.
level_input_info – Additional information about the level input.
active_instances – In which of the environment instances the level is active. env_obs and level_input correspond to these instances.
- Returns:
The action (also with a batch dimension). An info dict containing additional information about the action.
- class layeredrl.levels.TianshouLevel(tianshou_config: Dict[str, Any], buffer_size: int | None = None, batch_size: int = 64, update_interval: int = 1, n_updates: int = 1, reward_calc_interval: int = 1, **kwargs)[source]
Bases:
Level- __init__(tianshou_config: Dict[str, Any], buffer_size: int | None = None, batch_size: int = 64, update_interval: int = 1, n_updates: int = 1, reward_calc_interval: int = 1, **kwargs)[source]
A level which uses a Tianshou algorithm to learn a policy.
- Parameters:
tianshou_config –
The config containing all parameters of the Tianshou objects. Has to contain the following keys:
n_critics: The number of critic neural networks.
nets: The configs of the neural networks.
actor_net: The config of the actor neural network.
critic_net: The config of the critic neural network.
optims: The config of the optimizers.
actor_optim: The config of the actor optimizer.
critic_optim: The config of the critic optimizer.
actor: The config of the actor.
critic: The config of the critic.
policy: The config of the policy. In Tianshou policy refers to the whole algorithm, not only to the policy itself.
Note that Hydra’s instantiate function is used to create the objects from the configs. Hence, the configs should specify the full class path of the objects to instantiate. For example, to create an Adam optimizer, the actor_optim item in the config should contain:
_target_: torch.optim.Adam lr: 0.001
buffer_size – The size of the (replay) buffer. If None, no buffer is created.
batch_size – The number of samples to draw from the buffer for one training update.
update_interval – The interval in which the level/the policy is updated.
n_updates – The number of updates per interval.
reward_calc_inteval – The interval in which the reward is calculated. Setting this to something larger than 1 can be useful for performance reasons (if the reward is computed later during sampling).
- get_action(mapped_env_obs: Tensor, level_input: Tensor | None, level_input_info: Dict | None, active_instances: Tensor = tensor([True])) Tensor[source]
Get an action for the given observation.
Note that only the action for the active instances is returned.
Call this at the beginning of the implementation of get_action in derived classes.
- Parameters:
mapped_env_obs – The environment observation after the self.env_obs_map has been applied. Note that the observation has a batch dimension (for multiple environment instances).
level_input – The input to this level for the active instances, i.e., the action from the level above.
active_instances – In which of the environment instances the level is active. env_obs and level_input correspond to these instances.
- Returns:
The action (also with a batch dimension).
- get_input_space() Space[source]
Get the input space of level.
Input space denotes everything the level needs except for the environment observation, e.g. a skill vector.
- Returns:
The input space (or None if no input is required).
- initialize(env_obs_space: Space, action_space: Space, n_env_instances: int, parent_predictor: Predictor, env_obs_map: Callable[[Tensor], Tensor] | None = None, mapped_env_obs_shape: int | None = None, keep_params: bool = False) None[source]
Construct the level.
- Parameters:
env_obs_space – The observation space of the environment.
action_space – The action space of the level. If this is the lowest level, this is the action space of the environment. Otherwise, it is the input space of the level below this one.
n_env_instances – The number of environment instances.
parent_predictor – The predictor of the parent level (or None if there is no parent or the parent does not have a predictor).
env_obs_map – A map that is applied to the environment observation. This can be used to implement information hiding and for moving a trained level from one environment to another with a different observation space. If None, the identity map is used.
mapped_env_obs_shape – Shape of the output of env_obs_map. If None, the shape of the environment observation space is used.
keep_params – Whether to keep the parameters of the level (e.g. the policy) when initializing. If False, the parameters are reset.
- process_transition(mapped_env_obs: Tensor, level_input: Tensor | None, action: Tensor, next_mapped_env_obs: Tensor, terminated: Tensor, truncated: Tensor, active_instances: Tensor) bool[source]
Process transition and check whether level would like to return control to the level above.
This usually involves adding the transition to the replay buffer and possibly preprocessing it.
Note that everything has a batch dimension.
- Parameters:
mapped_env_obs – The mapped environment observation for the active instances.
level_input – The input to this level for the active instances, i.e., the action from the level above.
action – The action that was taken by the level.
next_mapped_env_obs – The next mapped environment observation for the active instances.
terminated – Whether the episode terminated for the active instances.
truncated – Whether the episode was truncated for the active instances.
active_instances – In which of the environment instances the level is active. next_obs and terminated correspond to these instances.
- Returns:
Whether the level is done, i.e. whether it hands control back to the level above.
- class layeredrl.levels.DADSLevel(skill_space_dim: int, discrete_skills: bool = False, control_interval: int = 1, termination_probability: float = 0.0, n_skill_samples: int = 100, reward_clipping_low: float | Callable = -50.0, reward_clipping_high: float | Callable = 50.0, reward_offset: float = 1.0, warm_up_steps: int = 2000, reward_scale: float = 1.0, bootstrap: bool = False, log_interval: int = 1000, fixed_std: float | None = 1.0, scale_rew_with_std: bool = False, **kwargs)[source]
Bases:
TDMPC2LevelA level that learns skills using DADS.
For the original DADS paper, see https://arxiv.org/abs/1907.01657
While the original DADS paper uses SAC, this implementation uses TD-MPC2 to learn the skills.
- __init__(skill_space_dim: int, discrete_skills: bool = False, control_interval: int = 1, termination_probability: float = 0.0, n_skill_samples: int = 100, reward_clipping_low: float | Callable = -50.0, reward_clipping_high: float | Callable = 50.0, reward_offset: float = 1.0, warm_up_steps: int = 2000, reward_scale: float = 1.0, bootstrap: bool = False, log_interval: int = 1000, fixed_std: float | None = 1.0, scale_rew_with_std: bool = False, **kwargs) None[source]
Initialize the level.
- Parameters:
skill_space_dim – The dimension of the skill space (or the number of skills in case of discrete skills).
discrete_skills – Whether the skills are discrete or continuous. Only continuous skills are supported for now.
control_interval – How long to stay in control before control is returned to the level above.
termination_probability – The probability of terminating the skill after a transition before the control interval has elapsed.
n_skill_samples – The number of skill samples to draw from skill prior for the ‘denominator’ of the DADS reward.
reward_clipping_low – The value below which to clip the reward.
reward_clipping_high – The value above which to clip the reward.
reward_offset – Offset to add to the reward after standardization. This can be used to make the reward more positive in environments where the agent can terminate the episode early but is not supposed to do so.
warm_up_steps – The number of steps to collect before learning starts.
reward_scale – Factor to multiply onto rewards.
bootstrap – Whether to bootstrap when the skill vector does not change in a transition. This might be necessary for problems that do not allow differentiation in the transitions within one time step.
log_interval – The interval in which to log statistics.
fixed_std – If not None, the standard deviation of the skill dynamics model is set to this value when calculating the intrinsic reward. The model itself is not changed. This allows for tuning the length scale of the mutual ‘repulsion’ of the skills.
scale_rew_with_std – Whether to scale the reward with the squared standard deviation of the model. This is useful for approximately fixing the reward scale when tuning fixed_std and not messing up the exploration bonuses.
**kwargs – Keyword arguments for the TianshouLevel.
- get_action(mapped_env_obs: Tensor, level_input: Tensor | None, level_input_info: Dict | None, active_instances: Tensor = tensor([True])) Tensor[source]
Get an action for the given observation.
Note that only the action for the active instances is returned.
Call this at the beginning of the implementation of get_action in derived classes.
- Parameters:
mapped_env_obs – The environment observation after the self.env_obs_map has been applied. Note that the observation has a batch dimension (for multiple environment instances).
level_input – The input to this level for the active instances, i.e., the action from the level above.
active_instances – In which of the environment instances the level is active. env_obs and level_input correspond to these instances.
- Returns:
The action (also with a batch dimension). An info dict containing additional information about the action.
- get_copy() DADSLevel[source]
Get a copy of the level.
No parameters are copied, only the state of the level.
- Returns:
A copy of the level.
- get_fixed_std() Tensor[source]
Get the fixed standard deviation of the skill dynamics model.
Can depend on the total number of steps taken in the environment in case of a schedule.
- get_input_space() Space[source]
Get the input space of level (the skill space).
- Returns:
The input space (skill space).
- get_reward(mapped_env_obs: Tensor, level_input: Tensor | None, level_state: Dict, action: Tensor, next_mapped_env_obs: Tensor, next_level_state: Dict, terminated: bool, cum_env_reward: Tensor, elapsed_env_steps: Tensor, use_next_level_state: bool = False) Tuple[Tensor, Dict][source]
Get the reward for the given batch of transitions.
Note that everything has a batch dimension. Also note that there is no next_level_input because the level input doeos not change while the level is in control.
- Parameters:
mapped_env_obs – The mapped environment observation.
level_input – The input to this level.
level_state – The state of the level.
action – The action this level took.
next_mapped_env_obs – The next mapped environment observation.
next_level_state – The next state of the level.
terminated – Whether the episode terminated.
cum_env_reward – The sum of the environment rewards that were received during the transition.
elapsed_env_steps – The number of environment steps that elapsed during the transition.
use_next_level_state – Whether to use the next level state instead of next_mapped_env_obs. This is useful for visualization purposes where the next state is set manually.
- Returns:
The reward and the reward info dict.
- get_tdmpc2_obs_shape(mapped_env_obs_shape: Tuple[int, ...], env_obs_space: Box) int[source]
Get the shape of the observation for the TDMP2 algorithm.
- initialize(parent_predictor, keep_params=False, *args, **kwargs) None[source]
Initialize the level.
- Parameters:
keep_params – Whether to keep the parameters of the level (e.g. the policy) when initializing. If False, the parameters are reset.
*args – Arguments for the TianshouLevel.
**kwargs – Keyword arguments for the TianshouLevel.
- process_transition(mapped_env_obs: Tensor, level_input: Tensor | None, action: Tensor, next_mapped_env_obs: Tensor, terminated: Tensor, truncated: Tensor, active_instances: Tensor) bool[source]
Process transition and check whether level would like to return control to the level above.
This usually involves adding the transition to the replay buffer and possibly preprocessing it.
Note that everything has a batch dimension.
- Parameters:
mapped_env_obs – The mapped environment observation for the active instances.
level_input – The input to this level for the active instances, i.e., the action from the level above.
action – The action that was taken by the level.
next_mapped_env_obs – The next mapped environment observation for the active instances.
terminated – Whether the episode terminated for the active instances.
truncated – Whether the episode was truncated for the active instances.
active_instances – In which of the environment instances the level is active. next_obs and terminated correspond to these instances.
- Returns:
Whether the level is done, i.e. whether it hands control back to the level above.
- sample_from_skill_prior(batch_size: Size) Tensor[source]
Sample from the skill prior.
- Parameters:
batch_size – The batch size.
- Returns:
The sampled skill vectors in a tensor.
- class layeredrl.levels.SPlaTESLevel(skill_space_dim: int, control_interval: int = 10, n_skill_samples: int = 100, reward_clipping_low: float | Callable = -50.0, reward_clipping_high: float | Callable = 50.0, potential_clipping_low: float | Callable = -50.0, reward_offset: float = 1.0, fixed_std: float | Tensor | Callable | None = 0.1, potential_diff_reward: bool = False, use_policy: bool = False, freeze_after: int | None = None, *args, **kwargs)[source]
Bases:
TDMPC2LevelA level that learns skills using Stable Planning with Temporally Extended Skills (SPlaTES).
- __init__(skill_space_dim: int, control_interval: int = 10, n_skill_samples: int = 100, reward_clipping_low: float | Callable = -50.0, reward_clipping_high: float | Callable = 50.0, potential_clipping_low: float | Callable = -50.0, reward_offset: float = 1.0, fixed_std: float | Tensor | Callable | None = 0.1, potential_diff_reward: bool = False, use_policy: bool = False, freeze_after: int | None = None, *args, **kwargs) None[source]
Initialize the level.
- Parameters:
skill_space_dim – The dimension of the skill space (or the number of skills in case of discrete skills).
control_interval – How long to stay in control before control is returned to the level above.
n_skill_samples – The number of skill samples to draw from skill prior for the ‘denominator’ of the SPlaTES reward.
reward_clipping_low – The value below which to clip the reward.
reward_clipping_high – The value above which to clip the reward.
potential_clipping_low – The value below which to clip the potential. This can improve exploration in the initial phase of learning (while clipping the reward too aggressively will make the agent go up and down the potential).
reward_offset – Offset to add to the reward after standardization. This can be used to make the reward more positive in environments where the agent can terminate the episode early but is not supposed to do so.
fixed_std – If not None, the standard deviation of the skill dynamics model is set to this value when calculating the intrinsic reward. The model itself is not changed. This allows for tuning the length scale of the mutual ‘repulsion’ of the skills. Can also be a callable which maps the number of steps to a float in case of a schedule.
potential_diff_reward – Whether to use the potential difference as reward instead of the potential itself.
use_policy – Whether to use the policy instead of letting TD-MPC2 plan.
freeze_after – After how many steps to freeze the skills. Optional.
**kwargs – Keyword arguments for the TianshouLevel.
- get_action(mapped_env_obs: Tensor, level_input: Tensor | None, level_input_info: Dict | None, active_instances: Tensor = tensor([True])) Tensor[source]
Get an action for the given observation.
Note that only the action for the active instances is returned.
Call this at the beginning of the implementation of get_action in derived classes.
- Parameters:
mapped_env_obs – The environment observation after the self.env_obs_map has been applied. Note that the observation has a batch dimension (for multiple environment instances).
level_input – The input to this level for the active instances, i.e., the action from the level above.
level_input_info – Additional information about the level input.
active_instances – In which of the environment instances the level is active. env_obs and level_input correspond to these instances.
- Returns:
The action (also with a batch dimension). An info dict containing additional information about the action.
- get_copy() SPlaTESLevel[source]
Get a copy of the level.
No parameters are copied, only the state of the level.
- Returns:
A copy of the level.
- get_fixed_std() Tensor[source]
Get the fixed standard deviation of the skill dynamics model.
Can depend on the total number of steps taken in the environment in case of a schedule.
- get_input_space() Space[source]
Get the input space of level (the skill space).
- Returns:
The input space (skill space).
- get_level_state(mapped_env_obs: Tensor, level_input: Tensor, active_instances: Tensor) Dict[source]
Compute the state of the level.
This contains all information relevant for determining the level action except for the environment observation and the level input, which are handled separately.
- Parameters:
mapped_env_obs – The mapped environment observation of the active instances.
level_input – The input to this level for the active instances, i.e., the action from the level above.
active_instances – The instances for which to return the level state.
- Returns:
The level state.
- get_level_state_dims() Dict[str, int][source]
Get the dimensions of the level state.
Implement this in derived classes.
- Returns:
A dictionary with the dimensions of each item in the level state.
- get_reward(mapped_env_obs: Tensor, level_input: Tensor | None, level_state: Dict, action: Tensor, next_mapped_env_obs: Tensor, next_level_state: Dict, terminated: bool, cum_env_reward: Tensor, elapsed_env_steps: Tensor, use_next_level_state: bool = False) Tuple[Tensor, Dict][source]
Get the reward for the given batch of transitions (difference of mutual information estimates).
Note that everything has a batch dimension. Also note that there is no next_level_input because the level input doeos not change while the level is in control.
- Parameters:
mapped_env_obs – The mapped environment observation.
level_input – The input to this level.
level_state – The state of the level.
action – The action this level took.
next_mapped_env_obs – The next mapped environment observation.
next_level_state – The next state of the level.
terminated – Whether the episode terminated.
cum_env_reward – The sum of the environment rewards that were received during the transition.
elapsed_env_steps – The number of environment steps that elapsed during the transition.
use_next_level_state – Whether to use the next level state instead of next_mapped_env_obs. This is useful for visualization purposes where the next state is set manually.
- Returns:
The reward and the reward info dict.
- get_tdmpc2_obs_shape(mapped_env_obs_shape: Tuple[int, ...], env_obs_space: Box) int[source]
Get the shape of the observation for the TDMP2 algorithm.
- initialize(env_obs_space: Space, action_space: Space, n_env_instances: int, parent_predictor: Predictor, env_obs_map: Callable[[Tensor], Tensor] | None = None, mapped_env_obs_shape: int | None = None, keep_params: bool = False) None[source]
Construct the level.
- Parameters:
env_obs_space – The observation space of the environment.
action_space – The action space of the level. If this is the lowest level, this is the action space of the environment. Otherwise, it is the input space of the level below this one.
n_env_instances – The number of environment instances.
parent_predictor – The predictor of the parent level (or None if there is no parent or the parent does not have a predictor).
env_obs_map – A map that is applied to the environment observation. This can be used to implement information hiding and for moving a trained level from one environment to another with a different observation space. If None, the identity map is used.
mapped_env_obs_shape – Shape of the output of env_obs_map. If None, the shape of the environment observation space is used.
keep_params – Whether to keep the parameters of the level (e.g. the policy) when initializing. If False, the parameters are reset.
- process_transition(mapped_env_obs: Tensor, level_input: Tensor | None, action: Tensor, next_mapped_env_obs: Tensor, terminated: Tensor, truncated: Tensor, active_instances: Tensor) bool[source]
Process transition and check whether level would like to return control to the level above.
This usually involves adding the transition to the replay buffer and possibly preprocessing it.
Note that everything has a batch dimension.
- Parameters:
mapped_env_obs – The mapped environment observation for the active instances.
level_input – The input to this level for the active instances, i.e., the action from the level above.
action – The action that was taken by the level.
next_mapped_env_obs – The next mapped environment observation for the active instances.
terminated – Whether the episode terminated for the active instances.
truncated – Whether the episode was truncated for the active instances.
active_instances – In which of the environment instances the level is active. next_obs and terminated correspond to these instances.
- Returns:
Whether the level is done, i.e. whether it hands control back to the level above.
- sample_from_skill_prior(batch_size: Size) Tensor[source]
Sample from the skill prior.
- Parameters:
batch_size – The batch size.
- Returns:
The sampled skill vectors in a tensor.
- class layeredrl.levels.PlannerLevel(partial_planner: Callable[[...], Planner], predictor_factory: Callable[[...], Predictor], initial_guess: Tensor, horizon: int = 10, shift_initialization: bool = True, verbose: bool = False, alternate_with_noise: bool = False, switch_random_prob: float = 0.1, switch_planner_prob: float = 0.1, only_noise: bool = False, resample_on_end: bool = False, resample_random_action_prob: float = 1.0, buffer_size: int = 10000, batch_size: int = 256, model_batch_size: int = 256, warm_up_steps: int = 1000, no_planning_steps: int = 100000, update_interval: int = 1, n_updates: int = 1, param_reset_freeze: int = 2000, log_interval: int = 1000, use_ensemble_disagreement: bool = False, **kwargs)[source]
Bases:
LevelLevel that does MPC with a provided predictor (for dynamics, reward, and value) and planner.
- __init__(partial_planner: Callable[[...], Planner], predictor_factory: Callable[[...], Predictor], initial_guess: Tensor, horizon: int = 10, shift_initialization: bool = True, verbose: bool = False, alternate_with_noise: bool = False, switch_random_prob: float = 0.1, switch_planner_prob: float = 0.1, only_noise: bool = False, resample_on_end: bool = False, resample_random_action_prob: float = 1.0, buffer_size: int = 10000, batch_size: int = 256, model_batch_size: int = 256, warm_up_steps: int = 1000, no_planning_steps: int = 100000, update_interval: int = 1, n_updates: int = 1, param_reset_freeze: int = 2000, log_interval: int = 1000, use_ensemble_disagreement: bool = False, **kwargs)[source]
Initialize the level.
- Parameters:
partial_planner – A partial planner that expects predictor, action space etc. The Planner is responsible for the optimization of the action sequence.
predictor_factory – A factory that creates a new instance of Predictor. The Predictor is responsible for the dynamics model, reward and value functions (and potentially for mapping the observation to a latent space).
initial_guess – The initial guess for the optimal action. Shape: (action_dim, )
horizon – The horizon for planning.
shift_initialization – Whether to shift the initialization of the action sequence by the number of steps that have been executed in the environment.
verbose – Whether to print additional information during planning.
alternate_with_noise – Whether to alternate between planning and taking random actions.
switch_random_prob – The probability of switching from planning to random actions (per call to get_action).
switch_planner_prob – The probability of switching from planning to random actions (per call to get_action).
only_noise – Whether to use only noise instead of planning (and noise).
resample_on_end – Whether to resample the action only after the end of an episode.
resample_random_action_prob – The probability of resampling the action when in random mode. By default, the action is resampled every time.
buffer_size – The size of the replay buffer being filled with the transitions of this level.
batch_size – The batch size for learning the predictor (reward and value).
model_batch_size – The batch size for learning the dynamics model.
warm_up_steps – The number of steps without learning (only filling the replay buffer).
no_planning_steps – The number of steps without planning (only random actions). When the lower level has not learned anything yet, planning would be wasteful.
update_interval – The interval at which the predictor is updated (in terms of new transitions on this level and not in the environment).
n_updates – The number of updates per interval.
param_reset_freeze – The number of steps after a parameter reset during which no learning is performed. This gives the lower level(s) some time to recover from the parameter reset before the predictor is updated again.
log_interval – The interval in which to log statistics.
use_ensemble_disagreement – Whether to use ensemble disagreement as reward when planning.
- get_action(mapped_env_obs: Tensor, level_input: Tensor | None, level_input_info: Dict | None, active_instances: Tensor = tensor([True])) Tensor[source]
Get an action for the given observation.
Note that only the action for the active instances is returned.
Call this at the beginning of the implementation of get_action in derived classes.
- Parameters:
mapped_env_obs – The environment observation after the self.env_obs_map has been applied. Note that the observation has a batch dimension (for multiple environment instances).
level_input – The input to this level for the active instances, i.e., the action from the level above.
level_input_info – Additional information about the level input.
active_instances – In which of the environment instances the level is active. env_obs and level_input correspond to these instances.
- Returns:
The action (also with a batch dimension). An info dict containing additional information about the action.
- get_copy() Level[source]
Get a copy of the level.
No parameters are copied, only the state of the level.
- Returns:
A copy of the level.
- get_input_space() Space[source]
Get the input space of level.
Input space denotes everything the level needs except for the environment observation, e.g. a skill vector.
- Returns:
The input space (or None if no input is required).
- get_reward(mapped_env_obs: Tensor, level_input: Tensor | None, level_state: Dict, action: Tensor, next_mapped_env_obs: Tensor, next_level_state: Dict, terminated: bool, cum_env_reward: Tensor, elapsed_env_steps: Tensor) Tuple[Tensor, Dict][source]
Get the reward for the given batch of transitions.
Note that everything has a batch dimension. Also note that there is no next_level_input because the level input doeos not change while the level is in control.
- Parameters:
mapped_env_obs – The mapped environment observation.
level_input – The input to this level.
level_state – The state of the level.
action – The action this level took.
next_mapped_env_obs – The next mapped environment observation.
next_level_state – The next state of the level.
terminated – Whether the episode terminated.
cum_env_reward – The sum of the environment rewards that were received during the transition.
elapsed_env_steps – The number of environment steps that elapsed during the transition.
- Returns:
The reward and the reward info dict.
- initialize(env_obs_space: Space, action_space: Space, n_env_instances: int, parent_predictor: Predictor, env_obs_map: Callable[[Tensor], Tensor] | None = None, mapped_env_obs_shape: Tuple | None = None, keep_params: bool = False) None[source]
Construct the level.
- process_transition(mapped_env_obs: Tensor, level_input: Tensor | None, action: Tensor, next_mapped_env_obs: Tensor, terminated: Tensor, truncated: Tensor, active_instances: Tensor) bool[source]
Process transition and check whether level would like to return control to the level above.
This usually involves adding the transition to the replay buffer and possibly preprocessing it.
Note that everything has a batch dimension.
- Parameters:
mapped_env_obs – The mapped environment observation for the active instances.
level_input – The input to this level for the active instances, i.e., the action from the level above.
action – The action that was taken by the level.
next_mapped_env_obs – The next mapped environment observation for the active instances.
terminated – Whether the episode terminated for the active instances.
truncated – Whether the episode was truncated for the active instances.
active_instances – In which of the environment instances the level is active. next_obs and terminated correspond to these instances.
- Returns:
Whether the level is done, i.e. whether it hands control back to the level above.
- set_n_env_instances(n_env_instances: int) None[source]
Set the number of environment instances.
- Parameters:
n_env_instances – The number of environment instances.
- class layeredrl.levels.ConstantLevel(action: Tensor)[source]
Bases:
LevelA level that outputs a constant action.
- get_action(mapped_env_obs: Tensor, level_input: Tensor | None, level_input_info: Dict | None, active_instances: Tensor = tensor([True])) Tensor[source]
Get an action for the given observation (in this case a constant).
Note that only the action for the active instances is returned.
Call this at the beginning of the implementation of get_action in derived classes.
- Parameters:
mapped_env_obs – The environment observation after the self.env_obs_map has been applied. Note that the observation has a batch dimension (for multiple environment instances).
level_input – The input to this level for the active instances, i.e., the action from the level above.
level_input_info – Additional information about the level input.
active_instances – In which of the environment instances the level is active. env_obs and level_input correspond to these instances.
- Returns:
The action (also with a batch dimension). An info dict containing additional information about the action.
- class layeredrl.levels.TDMPC2Level(tdmpc2_config: Dict[str, Any] | None = None, update_interval: int = 1, n_updates: int = 1, reward_calc_interval: int = 1, warm_up_steps: int = 2000, goal_based: bool = False, goal_dim: int = 0, use_her: bool = False, achieved_goal_range: list | None = None, desired_goal_range: list | None = None, reward_func: Callable | None = None, reward_offset: float = 0.0, **kwargs)[source]
Bases:
Level- __init__(tdmpc2_config: Dict[str, Any] | None = None, update_interval: int = 1, n_updates: int = 1, reward_calc_interval: int = 1, warm_up_steps: int = 2000, goal_based: bool = False, goal_dim: int = 0, use_her: bool = False, achieved_goal_range: list | None = None, desired_goal_range: list | None = None, reward_func: Callable | None = None, reward_offset: float = 0.0, **kwargs)[source]
Level that uses TD-MPC2 to plan and learn a policy.
For an explanation of TD-MPC2 see: https://arxiv.org/abs/2310.16828
- Parameters:
tdmpc2_config – The configuration for the TD-MPC2 algorithm.
update_interval – The interval in which the level/the policy is updated.
n_updates – The number of updates per interval.
reward_calc_inteval – The interval in which the reward is calculated. Setting this to something larger than 1 can be useful for performance reasons (if the reward is computed later during sampling).
warm_up_steps – The number of steps to collect with random actoins before learning starts.
goal_based – Whether TDMPC2 should treat the goal separately. Requires desired_goal_range.
goal_dim – The dimension of the goal.
use_her – Whether to use hindsight experience replay. Requires goal_based and achieved_goal_range.
achieved_goal_range – The slice of the observation that contains the achieved goal.
desired_goal_range – The slice of the observation that contains the desired goal.
reward_func – A function that computes the reward from the achieved and desired goal.
reward_offset – An offset that is added to the reward.
- get_action(mapped_env_obs: Tensor, level_input: Tensor | None, level_input_info: Dict | None, active_instances: Tensor = tensor([True])) Tensor[source]
Get an action for the given observation using TD-MPC2 (so MPC with MPPI + policy).
Note that only the action for the active instances is returned.
Call this at the beginning of the implementation of get_action in derived classes.
- Parameters:
mapped_env_obs – The environment observation after the self.env_obs_map has been applied. Note that the observation has a batch dimension (for multiple environment instances).
level_input – The input to this level for the active instances, i.e., the action from the level above.
active_instances – In which of the environment instances the level is active. env_obs and level_input correspond to these instances.
- Returns:
The action (also with a batch dimension). An info dict containing additional information about the action.
- get_input_space() Space[source]
Get the input space of level.
Input space denotes everything the level needs except for the environment observation, e.g. a skill vector.
- Returns:
The input space (or None if no input is required).
- get_tdmpc2_obs_shape(mapped_env_obs_shape: Tuple[int, ...], env_obs_space: Box) int[source]
Get the shape of the observation for the TDMP2 algorithm.
- initialize(env_obs_space: Space, action_space: Space, n_env_instances: int, parent_predictor: Predictor, env_obs_map: Callable[[Tensor], Tensor] | None = None, mapped_env_obs_shape: int | None = None, keep_params: bool = False) None[source]
Construct the level.
- Parameters:
env_obs_space – The observation space of the environment.
action_space – The action space of the level. If this is the lowest level, this is the action space of the environment. Otherwise, it is the input space of the level below this one.
n_env_instances – The number of environment instances.
parent_predictor – The predictor of the parent level (or None if there is no parent or the parent does not have a predictor).
env_obs_map – A map that is applied to the environment observation. This can be used to implement information hiding and for moving a trained level from one environment to another with a different observation space. If None, the identity map is used.
mapped_env_obs_shape – Shape of the output of env_obs_map. If None, the shape of the environment observation space is used.
keep_params – Whether to keep the parameters of the level (e.g. the policy) when initializing. If False, the parameters are reset.
- process_transition(mapped_env_obs: Tensor, level_input: Tensor | None, action: Tensor, next_mapped_env_obs: Tensor, terminated: Tensor, truncated: Tensor, active_instances: Tensor) bool[source]
Process transition and check whether level would like to return control to the level above.
This usually involves adding the transition to the replay buffer and possibly preprocessing it.
Note that everything has a batch dimension.
- Parameters:
mapped_env_obs – The mapped environment observation for the active instances.
level_input – The input to this level for the active instances, i.e., the action from the level above.
action – The action that was taken by the level.
next_mapped_env_obs – The next mapped environment observation for the active instances.
terminated – Whether the episode terminated for the active instances.
truncated – Whether the episode was truncated for the active instances.
active_instances – In which of the environment instances the level is active. next_obs and terminated correspond to these instances.
- Returns:
Whether the level is done, i.e. whether it hands control back to the level above.
- class layeredrl.levels.ActionSequenceLevel(action_sequence: Tensor, repeat: int = 1, **kwargs)[source]
Bases:
LevelA level that outputs a pre-defined sequence of actions.
- __init__(action_sequence: Tensor, repeat: int = 1, **kwargs)[source]
Initialize the level.
- Parameters:
action_sequence – The action sequence to output.
repeat – How often to repeat each action.
- get_action(mapped_env_obs: Tensor, level_input: Tensor | None, level_input_info: Dict | None, active_instances: Tensor = tensor([True])) Tensor[source]
Get an action for the given observation (in this case a constant).
Note that only the action for the active instances is returned.
Call this at the beginning of the implementation of get_action in derived classes.
- Parameters:
mapped_env_obs – The environment observation after the self.env_obs_map has been applied. Note that the observation has a batch dimension (for multiple environment instances).
level_input – The input to this level for the active instances, i.e., the action from the level above.
level_input_info – Additional information about the level input.
active_instances – In which of the environment instances the level is active. env_obs and level_input correspond to these instances.
- Returns:
The action (also with a batch dimension). An info dict containing additional information about the action.
- get_input_space() Space[source]
Get the input space of level.
Input space denotes everything the level needs except for the environment observation, e.g. a skill vector.
- Returns:
The input space (or None if no input is required).
- process_transition(mapped_env_obs: Tensor, level_input: Tensor | None, action: Tensor, next_mapped_env_obs: Tensor, terminated: Tensor, truncated: Tensor, active_instances: Tensor) bool[source]
Process transition and check whether level would like to return control to the level above.
This usually involves adding the transition to the replay buffer and possibly preprocessing it.
Note that everything has a batch dimension.
- Parameters:
mapped_env_obs – The mapped environment observation for the active instances.
level_input – The input to this level for the active instances, i.e., the action from the level above.
action – The action that was taken by the level.
next_mapped_env_obs – The next mapped environment observation for the active instances.
terminated – Whether the episode terminated for the active instances.
truncated – Whether the episode was truncated for the active instances.
active_instances – In which of the environment instances the level is active. next_obs and terminated correspond to these instances.
- Returns:
Whether the level is done, i.e. whether it hands control back to the level above.