Levels

Levels are the building blocks of hierarchies in LayeredRL. Each level implements a particular control strategy like an RL or planning algorithm. To this end, it

Takes an environment observation and level input (from the level above) and produces a level output
Processes each (semi-MDP) transition, e.g. by calculating a reward and storing the transition in a replay buffer
Learns from this experience, e.g. by sampling a batch from the replay buffer and performing gradient descent on a loss

Available Level Types

Available level types:

Supported levels	Type
`TD-MPC2`	Model-based RL
`Tianshou`	Model-free RL
`DADS`	Skill (TD-MPC2 version)
`SPlaTES`	Skill
`Planner with (i)CEM`	Planner

TD-MPC2 Level

Model-based RL using a vectorized version of the TD-MPC2 algorithm.

from layeredrl.levels import TDMPC2Level
from layeredrl.tdmpc2 import get_default_tdmpc2_config

tdmpc2_config = get_default_tdmpc2_config()
tdmpc2_config["discount_factor"] = [0.95]
level = TDMPC2Level(tdmpc2_config=tdmpc2_config)

Setting goal_based=True will switch to a goal-based version of TD-MPC2 which treats the goal separately. This requires specifying which parts of the observation contain the desired and achieved goal. Using HER requires specifying a reward function. See TDMPC2Level for details.

Tianshou Level

Model-free RL using the Tianshou library. Hydra’s instantiate function is used to instantiate actor, critics etc. as specified in a config dictionary.

from layeredrl.levels import TianshouLevel

tianshou_config = {
    "n_critics": 2,
    "actor": {
        "_target_": "tianshou.utils.net.continuous.ActorProb",
        ...
    },
    "critic": {
        "_target_": "layeredrl.nets.Critic",
        ...
    },
    "optims": {
        ...
    },
    "policy": {
        "_target_": "tianshou.policy.SACPolicy",
        ...
    },
    "policy_dynamic_args": {
        "actor": "__actor__",
        "actor_optim": "__actor_optim__",
        "critic1": "__critic_0__",
        "critic1_optim": "__critic_optim_0__",
        "critic2": "__critic_1__",
        "critic2_optim": "__critic_optim_1__",
        "action_space": "__action_space__",
    },
}

level = TianshouLevel(tianshou_config=tianshou_config, buffer_size=10000)

Features:

Access to various Tianshou algorithms (SAC, PPO, TD3, etc.)
Efficient model-free learning

Note that the level was only tested with SAC and DQN so far. See TianshouLevel for details.

DADS Level

Diversity-driven skill discovery with the DADS algorithm. Note that this implementation is based on TD-MPC2 while the original implementation used SAC. For details see DADSLevel.

from layeredrl.levels import DADSLevel

dads_level = DADSLevel(
    skill_space_dim=...,  # dimension of skill vector
)

Features:

Learns diverse skills by maximizing mutual information between one-step transitions and skill vector
Requires parent level to contain a predictor with an encoder that defines the space in which to diversify the skills

SPlaTES Level

Skill discovery with the SPlaTES algorithm. The discovered skills are

temporally extended,
predictable, and
diverse.

Whether the skills are useful for the task depends on the encoder of the predictor of the parent level, which defines the space in which the skills are trained to be predictable and diverse. It can either be fixed manually with domain knowledge (see examples/splates_hierarchy.py) or learned from a dense reward (see TODO).

from layeredrl.levels import SPlaTESLevel

splates_level = SPlaTESLevel(
    skill_space_dim=...,  # dimension of skill vector
)

Features:

Temporally extended, predictable and diverse skills
Skills enable long-horizon planning
Requires parent level to contain a predictor with an encoder that defines the space in which to diversify the skills

For details see SPlaTESLevel.

Planner Level

Learns a world model in a predictor, which enables it to pick good actions via running a planner. Note that the predictor models the transitions the planner level sees. These might be Semi-MDP transitions, i.e., they last several environment time steps, if the planner level is used as a higher level (not at the bottom of the hierarchy). At the moment, LayeredRL implements CEM and iCEM as zeroth-order optimization methods for planning.

The example code below uses a default predictor which is based on RewardPredictor and the ProbabilisticEnsemble model and assumes that the environment is goal-based. The encoder treats the desired goal as the context and the achieved goal as the state. The predictor can be customized to adapt to different environments and to tune hyperparameters, however.

from layeredrl.levels import PlannerLevel
from layeredrl.predictors import get_default_predictor_factory
from layeredrl.planners import CEMPlanner

predictor_factory = get_default_predictor_factory(env)
cem_params = {
    "n_samples": 256,
    "n_iterations": 6,
    "elite_ratio": 0.05,
    "momentum": momentum,
    "clip": clip,
    "return_mode": "mean",
}
planner_factory = partial(
    CEMPlanner,
    cem_params=cem_params,
)
planner_level = PlannerLevel(
    partial_planner=planner_factory,
    predictor_factory=predictor_factory,
    initial_guess=torch.zeros(...),
    horizon=horizon,
)

Features:

Model-based planning using a world model
Works with learned value/reward functions
Can alternate with noise for good coverage of action space during world model learning
Good for high-level strategic planning

Custom Levels

You can create custom levels by inheriting from the Level base class and implementing the required abstract methods. This incomplete example demonstrates what needs to be implemented for a class which generates actions based on some form of policy.

from typing import Dict, Optional, Tuple
import torch
from gymnasium.spaces import Space, Box
from layeredrl.levels import Level

class MyCustomLevel(Level):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        # Initialize your level-specific parameters
        self.policy = None  # Will be created in initialize()

    def initialize(
        self,
        env_obs_space: Space,
        action_space: Space,
        n_env_instances: int,
        parent_predictor,
        env_obs_map=None,
        mapped_env_obs_shape=None,
        keep_params=False,
    ):
        """Initialize the level after the hierarchy is constructed."""
        super().initialize(
            env_obs_space,
            action_space,
            n_env_instances,
            parent_predictor,
            env_obs_map,
            mapped_env_obs_shape,
            keep_params,
        )
        if not keep_params:
            # Create your policy/networks here
            self.policy = ...  # Initialize based on action_space, etc.

    def get_input_space(self) -> Space:
        """Define what input this level expects from the level above.

        Returns None if no input is needed (e.g., for top-level).
        For middle levels, return a Box or Discrete space.
        """
        return None  # or Box(...) for levels expecting input, e.g. goals or skill vectors

    def get_action(
        self,
        mapped_env_obs: torch.Tensor,
        level_input: Optional[torch.Tensor],
        level_input_info: Optional[Dict],
        active_instances: torch.Tensor,
    ) -> Tuple[torch.Tensor, Dict]:
        """Compute action for active environment instances.

        Args:
            mapped_env_obs: Environment observation (batch_size, ...)
            level_input: Input from level above (batch_size, input_dim)
            level_input_info: Additional info about level input
            active_instances: Boolean mask of active instances

        Returns:
            action: Action tensor (num_active, action_dim)
            action_info: Dict with additional action information
        """
        # Get level state if needed
        level_state = self.get_level_state(
            mapped_env_obs, level_input, active_instances
        )
        # Compute action using your policy
        action = self.policy(mapped_env_obs[active_instances], level_state, ...)
        return action, {}

    def process_transition(
        self,
        mapped_env_obs: torch.Tensor,
        level_input: Optional[torch.Tensor],
        action: torch.Tensor,
        next_mapped_env_obs: torch.Tensor,
        terminated: torch.Tensor,
        truncated: torch.Tensor,
        active_instances: torch.Tensor,
    ) -> torch.Tensor:
        """Process completed transitions.

        Store transition in replay buffer, compute rewards, etc.

        Returns:
            done: Boolean tensor indicating which instances want to return control to the
                level above.
        """
        # Compute reward
        level_state = self.get_level_state(
            mapped_env_obs, level_input, active_instances
        )
        next_level_state = self.get_level_state(
            next_mapped_env_obs, level_input, active_instances
        )
        reward, _ = self.get_reward(
            mapped_env_obs,
            level_input,
            level_state,
            action,
            next_mapped_env_obs,
            next_level_state,
            terminated,
            self.cum_reward[active_instances],
            self.elapsed_env_steps[active_instances],
        )

        # Call parent to update counters and reset cum_reward
        super().process_transition(
            mapped_env_obs,
            level_input,
            action,
            next_mapped_env_obs,
            terminated,
            truncated,
            active_instances,
        )

        # Store transition in your buffer
        # self.buffer.add(...)

        # Return False to keep control, True to return to level above
        return torch.zeros(
            active_instances.shape, dtype=torch.bool, device=self.device
        )

    def learn(self):
        """Update the policy from collected experience."""
        # Sample from buffer and update policy
        # batch = self.buffer.sample(batch_size)
        # loss = ...
        # ...
        pass

Key methods to implement:

get_input_space(): Define the input this level expects
get_action(): Compute actions from observations
process_transition(): Handle completed transitions
learn(): Update the policy (optional, but necessary for trainable levels)

Optional methods you may want to override:

get_level_state(): Add recurrent state or other context
get_reward(): Define custom rewards (defaults to environment reward)
save()/load(): Save and load level parameters

If you want the custom level to be compatible with vector environments, make sure to respect the active_instances tensor which indicates for which of the environments the level is currently active.

API Reference

For detailed API documentation, see Levels API.