Levels
Levels are the building blocks of hierarchies in LayeredRL. Each level implements a particular control strategy like an RL or planning algorithm. To this end, it
Takes an environment observation and level input (from the level above) and produces a level output
Processes each (semi-MDP) transition, e.g. by calculating a reward and storing the transition in a replay buffer
Learns from this experience, e.g. by sampling a batch from the replay buffer and performing gradient descent on a loss
Available Level Types
Available level types:
Supported levels |
Type |
|---|---|
Model-based RL |
|
Model-free RL |
|
Skill (TD-MPC2 version) |
|
Skill |
|
Planner |
TD-MPC2 Level
Model-based RL using a vectorized version of the TD-MPC2 algorithm.
from layeredrl.levels import TDMPC2Level
from layeredrl.tdmpc2 import get_default_tdmpc2_config
tdmpc2_config = get_default_tdmpc2_config()
tdmpc2_config["discount_factor"] = [0.95]
level = TDMPC2Level(tdmpc2_config=tdmpc2_config)
Setting goal_based=True will switch to a goal-based version of TD-MPC2 which treats the goal separately. This requires
specifying which parts of the observation contain the desired and achieved goal. Using HER requires specifying a reward
function. See TDMPC2Level for details.
Tianshou Level
Model-free RL using the Tianshou library. Hydra’s instantiate function is used to instantiate actor, critics etc. as specified in a config dictionary.
from layeredrl.levels import TianshouLevel
tianshou_config = {
"n_critics": 2,
"actor": {
"_target_": "tianshou.utils.net.continuous.ActorProb",
...
},
"critic": {
"_target_": "layeredrl.nets.Critic",
...
},
"optims": {
...
},
"policy": {
"_target_": "tianshou.policy.SACPolicy",
...
},
"policy_dynamic_args": {
"actor": "__actor__",
"actor_optim": "__actor_optim__",
"critic1": "__critic_0__",
"critic1_optim": "__critic_optim_0__",
"critic2": "__critic_1__",
"critic2_optim": "__critic_optim_1__",
"action_space": "__action_space__",
},
}
level = TianshouLevel(tianshou_config=tianshou_config, buffer_size=10000)
Features:
Access to various Tianshou algorithms (SAC, PPO, TD3, etc.)
Efficient model-free learning
Note that the level was only tested with SAC and DQN so far. See TianshouLevel for details.
DADS Level
Diversity-driven skill discovery with the DADS algorithm. Note that this implementation
is based on TD-MPC2 while the original implementation used SAC. For details see DADSLevel.
from layeredrl.levels import DADSLevel
dads_level = DADSLevel(
skill_space_dim=..., # dimension of skill vector
)
Features:
Learns diverse skills by maximizing mutual information between one-step transitions and skill vector
Requires parent level to contain a predictor with an encoder that defines the space in which to diversify the skills
SPlaTES Level
Skill discovery with the SPlaTES algorithm. The discovered skills are
temporally extended,
predictable, and
diverse.
Whether the skills are useful for the task depends on the encoder of the predictor of the parent level, which defines the space in which the skills are trained to be predictable and diverse. It can either be fixed manually with domain knowledge (see examples/splates_hierarchy.py) or learned from a dense reward (see TODO).
from layeredrl.levels import SPlaTESLevel
splates_level = SPlaTESLevel(
skill_space_dim=..., # dimension of skill vector
)
Features:
Temporally extended, predictable and diverse skills
Skills enable long-horizon planning
Requires parent level to contain a predictor with an encoder that defines the space in which to diversify the skills
For details see SPlaTESLevel.
Planner Level
Learns a world model in a predictor, which enables it to pick good actions via running a planner. Note that the predictor models the transitions the planner level sees. These might be Semi-MDP transitions, i.e., they last several environment time steps, if the planner level is used as a higher level (not at the bottom of the hierarchy). At the moment, LayeredRL implements CEM and iCEM as zeroth-order optimization methods for planning.
The example code below uses a default predictor which is based on RewardPredictor
and the ProbabilisticEnsemble model and assumes that the environment is goal-based.
The encoder treats the desired goal as the context and the achieved goal as the state. The predictor can
be customized to adapt to different environments and to tune hyperparameters, however.
from layeredrl.levels import PlannerLevel
from layeredrl.predictors import get_default_predictor_factory
from layeredrl.planners import CEMPlanner
predictor_factory = get_default_predictor_factory(env)
cem_params = {
"n_samples": 256,
"n_iterations": 6,
"elite_ratio": 0.05,
"momentum": momentum,
"clip": clip,
"return_mode": "mean",
}
planner_factory = partial(
CEMPlanner,
cem_params=cem_params,
)
planner_level = PlannerLevel(
partial_planner=planner_factory,
predictor_factory=predictor_factory,
initial_guess=torch.zeros(...),
horizon=horizon,
)
Features:
Model-based planning using a world model
Works with learned value/reward functions
Can alternate with noise for good coverage of action space during world model learning
Good for high-level strategic planning
Custom Levels
You can create custom levels by inheriting from the Level base class and implementing
the required abstract methods. This incomplete example demonstrates what needs to be implemented for a class
which generates actions based on some form of policy.
from typing import Dict, Optional, Tuple
import torch
from gymnasium.spaces import Space, Box
from layeredrl.levels import Level
class MyCustomLevel(Level):
def __init__(self, **kwargs):
super().__init__(**kwargs)
# Initialize your level-specific parameters
self.policy = None # Will be created in initialize()
def initialize(
self,
env_obs_space: Space,
action_space: Space,
n_env_instances: int,
parent_predictor,
env_obs_map=None,
mapped_env_obs_shape=None,
keep_params=False,
):
"""Initialize the level after the hierarchy is constructed."""
super().initialize(
env_obs_space,
action_space,
n_env_instances,
parent_predictor,
env_obs_map,
mapped_env_obs_shape,
keep_params,
)
if not keep_params:
# Create your policy/networks here
self.policy = ... # Initialize based on action_space, etc.
def get_input_space(self) -> Space:
"""Define what input this level expects from the level above.
Returns None if no input is needed (e.g., for top-level).
For middle levels, return a Box or Discrete space.
"""
return None # or Box(...) for levels expecting input, e.g. goals or skill vectors
def get_action(
self,
mapped_env_obs: torch.Tensor,
level_input: Optional[torch.Tensor],
level_input_info: Optional[Dict],
active_instances: torch.Tensor,
) -> Tuple[torch.Tensor, Dict]:
"""Compute action for active environment instances.
Args:
mapped_env_obs: Environment observation (batch_size, ...)
level_input: Input from level above (batch_size, input_dim)
level_input_info: Additional info about level input
active_instances: Boolean mask of active instances
Returns:
action: Action tensor (num_active, action_dim)
action_info: Dict with additional action information
"""
# Get level state if needed
level_state = self.get_level_state(
mapped_env_obs, level_input, active_instances
)
# Compute action using your policy
action = self.policy(mapped_env_obs[active_instances], level_state, ...)
return action, {}
def process_transition(
self,
mapped_env_obs: torch.Tensor,
level_input: Optional[torch.Tensor],
action: torch.Tensor,
next_mapped_env_obs: torch.Tensor,
terminated: torch.Tensor,
truncated: torch.Tensor,
active_instances: torch.Tensor,
) -> torch.Tensor:
"""Process completed transitions.
Store transition in replay buffer, compute rewards, etc.
Returns:
done: Boolean tensor indicating which instances want to return control to the
level above.
"""
# Compute reward
level_state = self.get_level_state(
mapped_env_obs, level_input, active_instances
)
next_level_state = self.get_level_state(
next_mapped_env_obs, level_input, active_instances
)
reward, _ = self.get_reward(
mapped_env_obs,
level_input,
level_state,
action,
next_mapped_env_obs,
next_level_state,
terminated,
self.cum_reward[active_instances],
self.elapsed_env_steps[active_instances],
)
# Call parent to update counters and reset cum_reward
super().process_transition(
mapped_env_obs,
level_input,
action,
next_mapped_env_obs,
terminated,
truncated,
active_instances,
)
# Store transition in your buffer
# self.buffer.add(...)
# Return False to keep control, True to return to level above
return torch.zeros(
active_instances.shape, dtype=torch.bool, device=self.device
)
def learn(self):
"""Update the policy from collected experience."""
# Sample from buffer and update policy
# batch = self.buffer.sample(batch_size)
# loss = ...
# ...
pass
Key methods to implement:
get_input_space(): Define the input this level expectsget_action(): Compute actions from observationsprocess_transition(): Handle completed transitionslearn(): Update the policy (optional, but necessary for trainable levels)
Optional methods you may want to override:
get_level_state(): Add recurrent state or other contextget_reward(): Define custom rewards (defaults to environment reward)
If you want the custom level to be compatible with vector environments, make sure to respect the active_instances
tensor which indicates for which of the environments the level is currently active.
API Reference
For detailed API documentation, see Levels API.