Predictors API

class layeredrl.predictors.Predictor(model: Model, val_func: Module | None, rew_func: Module | None, encoder: Module, latent_state_dim: int, context_dim: int, device: device = device(type='cpu'), writer: SummaryWriter | None = None, log_interval: int = 10)[source]

Bases: ABC, Module

Abstract base class for prediction module.

This includes the dynamics and reward model and the value function.

__init__(model: Model, val_func: Module | None, rew_func: Module | None, encoder: Module, latent_state_dim: int, context_dim: int, device: device = device(type='cpu'), writer: SummaryWriter | None = None, log_interval: int = 10)[source]

Initialize the predictor.

Parameters:

model – The dynamics, and termination model. Takes in latent state, context and action.
val_func – Predicts the value given the latent state, the context, and the action. Thus, this could be a Q-function or a value function (that ignores the action). Required for training the predictor.
rew_func – Predicts the reward given the latent state, the context, and the action.
encoder – Maps the observation to the latent state and context.
latent_state_dim – The dimension of the latent state space.
context_dim – The dimension of the context variable (encoding time invariant information that stays constant during an episode).
device – The device to use.
log_interval – The interval in which to log to tensorboard (in calls to learn).

encode(obs: Tensor) → Tuple[Tensor, Tensor][source]

Encode the observation into the latent state and context.

Parameters:: obs – The observation.
Returns:: The latent state and context.

abstractmethod learn(buffer: ReplayBuffer, n_steps: int, batch_size: int, model_batch_size: int, n_total_env_steps: int) → Tuple[Tensor, Dict][source]

Learn from the given batch.

Parameters:

buffer – The replay buffer. Partial trajectories will be sampled from this buffer.
n_steps – The number of optimizer steps to take.
batch_size – The batch size, i.e., the number of partial trajectories to sample.
model_batch_size – The batch size for the model training.
n_total_env_steps – The total number of environment steps taken so far. Useful for logging and learning rate schedules.

Returns:

The loss and an info dictionary with the individual loss terms.

abstractmethod loss(batch: Batch) → Tuple[Tensor, Dict][source]

Compute the loss for the given batch.

Parameters:: batch – The batch. The first dimension corresponds to the batch dimension (e.g. environments). For example, batch.state.shape = (batch_size, per_env_size, state_dim)
Returns:: The loss and an info dictionary with the individual loss terms.

predict(obs: Tensor, act: Tensor) → Tensor[source]

Predict the next latent state, reward etc. given the current observation and action.

Parameters:

obs – The observation.
act – The action.

Returns:

Mean, weights, and std of the predicted mixture of Gaussians for the next latent state, expected reward, and termination probability. s_mean shape: (batch_size, n_models, n_modes, latent_state_dim) weights shape: (batch_size, n_models, n_modes) s_std shape: (batch_size, n_models, n_modes, latent_state_dim) reward shape: (batch_size, ) term_prob shape: (batch_size, )

predict_reward(obs: Tensor, act: Tensor) → Tensor[source]

Predict the reward given the current observation and action.

Parameters:

obs – The observation.
act – The action.

Returns:

The predicted reward.

rollout(obs: Tensor, act: Tensor, deterministic: bool = False) → Tuple[Batch, Tensor][source]

Rollout the model given the current observation and action.

Parameters:

obs – The observation.
act – The action.
deterministic – Whether to rollout deterministically.

Returns:

The rollout data and the contexts for the trajectories.

static sample_partial_trajectories(buffer: ReplayBuffer, batch_size: int, n_steps: int, device: device = device(type='cpu')) → Tuple[ndarray, Tensor, Tensor][source]

Sample partial trajectories from the replay buffer.

Parameters:

buffer – The replay buffer.
batch_size – The batch size, i.e., the number of partial trajectories to sample.
n_steps – The number of steps per partial trajectory.

Returns:

A numpy array with the indices (in the buffer) of the sampled partial trajectories. Shape: (n_steps, batch_size) A numpy array with the length of each partial trajectory (which can be different as some trajectories might have terminated). Shape: (batch_size,) A numpy array with the validity mask for each partial trajectory. The mask is 1.0 for valid time steps and 0.0 for invalid time steps beyond the end of an episode. Shape: (batch_size, n_steps)

value(obs: Tensor, act: Tensor) → Tensor[source]

Predict the value of the given state-action pair.

Parameters:

obs – The observation.
act – The action.

Returns:

The value.

class layeredrl.predictors.PredictorFactory(partial_model: Callable[[...], Model], partial_val_func: Callable[[...], Module | Callable] | None, partial_rew_func: Callable[[...], Module | Callable], partial_encoder: Callable[[...], Module | Callable], partial_predictor: Callable[[...], Predictor])[source]

Bases: object

A factory class creating predictors taking the latent state and context dimension into account.

__call__(mapped_env_obs_shape, action_space, device: device, writer: SummaryWriter) → Predictor[source]

Create a new predictor.

Parameters:

mapped_env_obs_shape – The shape of the mapped environment observation.
action_space – The action space associated to the predictor.
device – The device to use.

Returns:

A new predictor.

__init__(partial_model: Callable[[...], Model], partial_val_func: Callable[[...], Module | Callable] | None, partial_rew_func: Callable[[...], Module | Callable], partial_encoder: Callable[[...], Module | Callable], partial_predictor: Callable[[...], Predictor])[source]

Initialize the predictor factory.

Parameters:

partial_model – A partial model (expecting spaces).
partial_val_func – A partial value function (expecting latent state and context dim).
partial_rew_func – A partial reward function (expecting latent state, context dim, and action dim).
partial_encoder – A partial encoder (expecting mapped env obs shape, latent state dim, and context dim).
partial_predictor – A partial predictor (without model, value function, and map to latent).
latent_state_dim – The dimension of the latent state space.
context_dim – The dimension of the context variable (encoding time invariant information).

class layeredrl.predictors.RewardPredictor(model: Model, val_func: Module | None, rew_func: Module | None, encoder: Module, latent_state_dim: int, context_dim: int, gamma: float = 0.99, tau: float = 0.005, k_steps: int = 1, n_steps: int = 1, learn_encoder: bool = True, consistency_loss_weight: float = 1.0, value_loss_weight: float = 1.0, rew_loss_weight: float = 1.0, delta_standardizer_lr: float = 0.001, encoder_lr: Schedule | float = 0.0003, reward_lr: Schedule | float = 0.0003, model_lr: Schedule | float = 0.0003, value_lr: Schedule | float = 0.0003, encoder_warm_up: int = 20, delta_standardizer: bool = True, clip_grad_max_norm: float | None = 20.0, device: device = device(type='cpu'), writer: SummaryWriter | None = None, log_interval: int = 10)[source]

Bases: Predictor

Predictor that learns the encoder via the reward function.

Gradients from fitting the reward function are backpropagated to the encoder.

__init__(model: Model, val_func: Module | None, rew_func: Module | None, encoder: Module, latent_state_dim: int, context_dim: int, gamma: float = 0.99, tau: float = 0.005, k_steps: int = 1, n_steps: int = 1, learn_encoder: bool = True, consistency_loss_weight: float = 1.0, value_loss_weight: float = 1.0, rew_loss_weight: float = 1.0, delta_standardizer_lr: float = 0.001, encoder_lr: Schedule | float = 0.0003, reward_lr: Schedule | float = 0.0003, model_lr: Schedule | float = 0.0003, value_lr: Schedule | float = 0.0003, encoder_warm_up: int = 20, delta_standardizer: bool = True, clip_grad_max_norm: float | None = 20.0, device: device = device(type='cpu'), writer: SummaryWriter | None = None, log_interval: int = 10)[source]

Initialize the predictor.

Parameters:

model – The dynamics, reward and termination model. Takes in latent state and action.
val_func – Predicts the value given the latent state, the context, and the action. Thus, this could be a Q-function or a value function (that ignores the action).
rew_func – Predicts the reward given the latent state, the context, and the action.
encoder – Maps the observation to the latent state and context.
latent_state_dim – The dimension of the latent state space.
context_dim – The dimension of the context variable (encoding time invariant information).
gamma – The discount factor.
tau – The soft update rate for the target networks.
k_steps – The length of sampled trajectory parts.
n_steps – The number of steps in the n-step return target for the value function.
learn_encoder – Whether to learn the map to the latent space.
consistency_loss_weight – The weight of the consistency loss.
value_loss_weight – The weight of the value loss.
rew_loss_weight – The weight of the reward loss.
delta_standardizer_loss_weight – The learning rate for the delta standardizer.
encoder_lr – The learning rate for the encoder (can also be a schedule in total env steps).
reward_lr – The learning rate for the reward function (can also be a schedule in total env steps).
model_lr – The learning rate for the model (can also be a schedule in total env steps).
encoder_warm_up – The number of steps during which only to learn the map to the latent space.
clip_grad_max_norm – If not None, the gradients are clipped to the given maximum norm.
device – The device to use.
writer – A tensorboard summary writer for logging (optional).
log_interval – The interval in which to log to tensorboard (in calls to learn).

compute_targets(buffer: ReplayBuffer, batch_size: int, k_steps: int, n_steps: int, gamma: float = 0.99) → Batch[source]

Compute the targets for the given indices.

Note: This bootstraps at the end of an unfinished episode (an episode which is incomplete in the buffer because it has not ended yet). In other words, the end of an unfinished episode is treated like a truncated episode.

All rewards in a k_steps + n_steps trajectory are first added up with discounting and then differences of these values are considered. This lowers the complexity from O(k * n) to O(k + n).

Parameters:

buffer – The replay buffer.
batch_size – The batch size, i.e., the number of partial trajectories to sample.
k_steps – The length of sampled trajectory parts.
n_steps – The number of steps in the n-step return target for the value function.

Returns:

A batch containing the states, rewards, bootstrap values etc.

get_standardized_log_prob(state: Tensor, context: Tensor, action: Tensor, next_state: Tensor, fixed_std: Tensor) → Tuple[source]

Get log probability (density) of next state in the standardized space, the termination probability, and an info dict.

Note that everything is assumed to have a ‘batch’ dimension, useful for parallelization.

Parameters:

state – The current state.
context – The context, i.e., information that is constant over timesteps.
action – The action.
next_state – The next state.

Returns:

The log probability (density) of the next state given the current state and action under the model.
The termination probability given the current state and action under the model.
An info dict with additional information.

Return type:

A tuple containing

learn(buffer: ReplayBuffer, n_steps: int, batch_size: int, model_batch_size: int, n_total_env_steps: int) → Tuple[Tensor, Dict][source]

Sample batches, calculate losses, and update models.

Parameters:

buffer – The replay buffer. Partial trajectories will be sampled from this buffer.
n_steps – The number of optimizer steps to take.
batch_size – The batch size, i.e., the number of partial trajectories to sample (for value and reward functions).
model_batch_size – The batch size for training the model.
n_total_env_steps – The total number of environment steps taken so far. Useful for logging and learning rate schedules.

Returns:

The loss and an info dictionary with the individual loss terms.

loss(batch: Batch, model_batch: Batch) → Tuple[Tensor, Dict][source]

Compute the loss for the given batch.

Parameters:

batch – Batch used for reward and value learning. The first dimension corresponds to the batch dimension (e.g. environments), and the second dimension corresponds to n + k steps time steps. For example, batch.state.shape = (batch_size, n + k, state_dim)
model_batch – Batch used for training the model. The first dimension also corresponds to the batch dimension, but there is no second dimension for time steps, only individual transitions are sampled.

Returns:

The loss and an info dictionary with the individual loss terms.

sample_model_batch(buffer: ReplayBuffer, batch_size: int) → Batch[source]

Sample a batch for training the model. Includes only state, next state, context, next context, and validity mask. The states are sampled randomly and do not belong to partial trajectories.

Parameters:

buffer – The replay buffer.
batch_size – The batch size.

Returns:

The batch.

soft_update(target: Module, source: Module, tau: float) → None[source]

update_target_nets() → None[source]

class layeredrl.predictors.StaticPredictor(model: Model, val_func: Module | None, rew_func: Module | None, encoder: Module, latent_state_dim: int, context_dim: int, device: device = device(type='cpu'), writer: SummaryWriter | None = None, log_interval: int = 10)[source]

Bases: Predictor

A static predictor without any training functionality.

learn(buffer: ReplayBuffer, n_steps: int, batch_size: int) → None[source]

Learn from the given batch.

Parameters:

buffer – The replay buffer. Partial trajectories will be sampled from this buffer.
n_steps – The number of optimizer steps to take.
batch_size – The batch size, i.e., the number of partial trajectories to sample.

Returns:

The loss after the updates.

loss(batch: Batch) → Tensor[source]

Compute the loss for the given batch.

Parameters:

batch – The batch. The first dimension corresponds to the batch dimension (e.g. environments).
example (For)
= (batch.state.shape)

Returns:

The loss.

layeredrl.predictors.get_default_predictor_factory(env: Env, sb_start_duration: float) → PredictorFactory[source]

Get a default predictor factory.

The predictor factory creates a RewardPredictor object. The method assumes that the environment is goal-based and interprets the desired goal as the context and the achieved goal as the state for the planner level.

Parameters:

env – The environment for which to create the predictor factory.
sb_start_duration – Duration (in env steps) for symmetry breaking start.

Returns:

The predictor factory.