Levels ====== Levels are the building blocks of hierarchies in LayeredRL. Each level implements a particular control strategy like an RL or planning algorithm. To this end, it * Takes an environment observation and level input (from the level above) and produces a level output * Processes each (semi-MDP) transition, e.g. by calculating a reward and storing the transition in a replay buffer * Learns from this experience, e.g. by sampling a batch from the replay buffer and performing gradient descent on a loss Available Level Types --------------------- Available level types: +------------------------------------------------------------------+------------------------+ | Supported levels | Type | +==================================================================+========================+ | :class:`TD-MPC2 ` | Model-based RL | +------------------------------------------------------------------+------------------------+ | :class:`Tianshou ` | Model-free RL | +------------------------------------------------------------------+------------------------+ | :class:`DADS ` | Skill (TD-MPC2 version)| +------------------------------------------------------------------+------------------------+ | :class:`SPlaTES ` | Skill | +------------------------------------------------------------------+------------------------+ | :class:`Planner with (i)CEM ` | Planner | +------------------------------------------------------------------+------------------------+ TD-MPC2 Level ^^^^^^^^^^^^^ Model-based RL using a vectorized version of the `TD-MPC2 algorithm `__. .. code-block:: python from layeredrl.levels import TDMPC2Level from layeredrl.tdmpc2 import get_default_tdmpc2_config tdmpc2_config = get_default_tdmpc2_config() tdmpc2_config["discount_factor"] = [0.95] level = TDMPC2Level(tdmpc2_config=tdmpc2_config) Setting ``goal_based=True`` will switch to a goal-based version of TD-MPC2 which treats the goal separately. This requires specifying which parts of the observation contain the desired and achieved goal. Using HER requires specifying a reward function. See :class:`~layeredrl.levels.TDMPC2Level` for details. Tianshou Level ^^^^^^^^^^^^^^ Model-free RL using the `Tianshou library `__. Hydra's instantiate function is used to instantiate actor, critics etc. as specified in a config dictionary. .. code-block:: python from layeredrl.levels import TianshouLevel tianshou_config = { "n_critics": 2, "actor": { "_target_": "tianshou.utils.net.continuous.ActorProb", ... }, "critic": { "_target_": "layeredrl.nets.Critic", ... }, "optims": { ... }, "policy": { "_target_": "tianshou.policy.SACPolicy", ... }, "policy_dynamic_args": { "actor": "__actor__", "actor_optim": "__actor_optim__", "critic1": "__critic_0__", "critic1_optim": "__critic_optim_0__", "critic2": "__critic_1__", "critic2_optim": "__critic_optim_1__", "action_space": "__action_space__", }, } level = TianshouLevel(tianshou_config=tianshou_config, buffer_size=10000) **Features:** * Access to various Tianshou algorithms (SAC, PPO, TD3, etc.) * Efficient model-free learning Note that the level was only tested with SAC and DQN so far. See :class:`~layeredrl.levels.TianshouLevel` for details. DADS Level ^^^^^^^^^^ Diversity-driven skill discovery with the `DADS algorithm `__. Note that this implementation is based on TD-MPC2 while the original implementation used SAC. For details see :class:`~layeredrl.levels.DADSLevel`. .. code-block:: python from layeredrl.levels import DADSLevel dads_level = DADSLevel( skill_space_dim=..., # dimension of skill vector ) **Features:** * Learns diverse skills by maximizing mutual information between one-step transitions and skill vector * Requires parent level to contain a predictor with an encoder that defines the space in which to diversify the skills SPlaTES Level ^^^^^^^^^^^^^ Skill discovery with the `SPlaTES algorithm `__. The discovered skills are * temporally extended, * predictable, and * diverse. Whether the skills are useful for the task depends on the encoder of the predictor of the parent level, which defines the space in which the skills are trained to be predictable and diverse. It can either be fixed manually with domain knowledge (see `examples/splates_hierarchy.py`) or learned from a dense reward (see TODO). .. code-block:: python from layeredrl.levels import SPlaTESLevel splates_level = SPlaTESLevel( skill_space_dim=..., # dimension of skill vector ) **Features:** * Temporally extended, predictable and diverse skills * Skills enable long-horizon planning * Requires parent level to contain a predictor with an encoder that defines the space in which to diversify the skills For details see :class:`~layeredrl.levels.SPlaTESLevel`. Planner Level ^^^^^^^^^^^^^ Learns a world model in a predictor, which enables it to pick good actions via running a planner. Note that the predictor models the transitions the planner level sees. These might be Semi-MDP transitions, i.e., they last several environment time steps, if the planner level is used as a higher level (not at the bottom of the hierarchy). At the moment, LayeredRL implements CEM and iCEM as zeroth-order optimization methods for planning. The example code below uses a default predictor which is based on :class:`~layeredrl.predictors.RewardPredictor` and the :class:`~layeredrl.models.ProbabilisticEnsemble` model and assumes that the environment is goal-based. The encoder treats the desired goal as the context and the achieved goal as the state. The predictor can be customized to adapt to different environments and to tune hyperparameters, however. .. code-block:: python from layeredrl.levels import PlannerLevel from layeredrl.predictors import get_default_predictor_factory from layeredrl.planners import CEMPlanner predictor_factory = get_default_predictor_factory(env) cem_params = { "n_samples": 256, "n_iterations": 6, "elite_ratio": 0.05, "momentum": momentum, "clip": clip, "return_mode": "mean", } planner_factory = partial( CEMPlanner, cem_params=cem_params, ) planner_level = PlannerLevel( partial_planner=planner_factory, predictor_factory=predictor_factory, initial_guess=torch.zeros(...), horizon=horizon, ) **Features:** * Model-based planning using a world model * Works with learned value/reward functions * Can alternate with noise for good coverage of action space during world model learning * Good for high-level strategic planning Custom Levels ------------- You can create custom levels by inheriting from the :class:`~layeredrl.levels.Level` base class and implementing the required abstract methods. This incomplete example demonstrates what needs to be implemented for a class which generates actions based on some form of policy. .. code-block:: python from typing import Dict, Optional, Tuple import torch from gymnasium.spaces import Space, Box from layeredrl.levels import Level class MyCustomLevel(Level): def __init__(self, **kwargs): super().__init__(**kwargs) # Initialize your level-specific parameters self.policy = None # Will be created in initialize() def initialize( self, env_obs_space: Space, action_space: Space, n_env_instances: int, parent_predictor, env_obs_map=None, mapped_env_obs_shape=None, keep_params=False, ): """Initialize the level after the hierarchy is constructed.""" super().initialize( env_obs_space, action_space, n_env_instances, parent_predictor, env_obs_map, mapped_env_obs_shape, keep_params, ) if not keep_params: # Create your policy/networks here self.policy = ... # Initialize based on action_space, etc. def get_input_space(self) -> Space: """Define what input this level expects from the level above. Returns None if no input is needed (e.g., for top-level). For middle levels, return a Box or Discrete space. """ return None # or Box(...) for levels expecting input, e.g. goals or skill vectors def get_action( self, mapped_env_obs: torch.Tensor, level_input: Optional[torch.Tensor], level_input_info: Optional[Dict], active_instances: torch.Tensor, ) -> Tuple[torch.Tensor, Dict]: """Compute action for active environment instances. Args: mapped_env_obs: Environment observation (batch_size, ...) level_input: Input from level above (batch_size, input_dim) level_input_info: Additional info about level input active_instances: Boolean mask of active instances Returns: action: Action tensor (num_active, action_dim) action_info: Dict with additional action information """ # Get level state if needed level_state = self.get_level_state( mapped_env_obs, level_input, active_instances ) # Compute action using your policy action = self.policy(mapped_env_obs[active_instances], level_state, ...) return action, {} def process_transition( self, mapped_env_obs: torch.Tensor, level_input: Optional[torch.Tensor], action: torch.Tensor, next_mapped_env_obs: torch.Tensor, terminated: torch.Tensor, truncated: torch.Tensor, active_instances: torch.Tensor, ) -> torch.Tensor: """Process completed transitions. Store transition in replay buffer, compute rewards, etc. Returns: done: Boolean tensor indicating which instances want to return control to the level above. """ # Compute reward level_state = self.get_level_state( mapped_env_obs, level_input, active_instances ) next_level_state = self.get_level_state( next_mapped_env_obs, level_input, active_instances ) reward, _ = self.get_reward( mapped_env_obs, level_input, level_state, action, next_mapped_env_obs, next_level_state, terminated, self.cum_reward[active_instances], self.elapsed_env_steps[active_instances], ) # Call parent to update counters and reset cum_reward super().process_transition( mapped_env_obs, level_input, action, next_mapped_env_obs, terminated, truncated, active_instances, ) # Store transition in your buffer # self.buffer.add(...) # Return False to keep control, True to return to level above return torch.zeros( active_instances.shape, dtype=torch.bool, device=self.device ) def learn(self): """Update the policy from collected experience.""" # Sample from buffer and update policy # batch = self.buffer.sample(batch_size) # loss = ... # ... pass **Key methods to implement:** * :meth:`~layeredrl.levels.Level.get_input_space`: Define the input this level expects * :meth:`~layeredrl.levels.Level.get_action`: Compute actions from observations * :meth:`~layeredrl.levels.Level.process_transition`: Handle completed transitions * :meth:`~layeredrl.levels.Level.learn`: Update the policy (optional, but necessary for trainable levels) **Optional methods you may want to override:** * :meth:`~layeredrl.levels.Level.get_level_state`: Add recurrent state or other context * :meth:`~layeredrl.levels.Level.get_reward`: Define custom rewards (defaults to environment reward) * :meth:`~layeredrl.levels.Level.save`/:meth:`~layeredrl.levels.Level.load`: Save and load level parameters If you want the custom level to be compatible with vector environments, make sure to respect the ``active_instances`` tensor which indicates for which of the environments the level is currently active. API Reference ------------- For detailed API documentation, see :doc:`../api/levels`.