Hierarchies

The Hierarchy class is the core orchestrator in LayeredRL that manages multiple levels and coordinates their interactions. It is defined by a list of levels \([\text{level}_0, ..., \text{level}_{n - 1}]\) where the highest level comes first and the lowest last. The output of level \(i\) is the input of level \(i+1\) (for example a skill vector or subgoal).

For simplicity, assume that there is only one environment instance for now. The hierarchy then has to keep track of which level is active with an integer \(i \in [0,\ldots, n-1]\). Initially, the highest, most abstract level is active, i.e., \(i=0\). When the hierarchy is queried for a primitive action, a forward pass descends the hierarchy starting with the active level \(i\). For this level, \(\text{level}_i.\)get_action() is executed resulting in some (abstract) action \(a_{i}\). If there is a level below the active one (\(i < n - 1\)), then \(i\) is incremented, \(i = i+1\), and the procedure starts again with executing \(\text{level}_i.\)get_action(), this time with \(a_{i-1}\) as an input (in addition to the environment observation). \(a_{i-1}\) has to be saved until control returns to level \(i-1\). This process continues until \(i=n-1\), i.e., the lowest level of the hierarchy is reached, which produces a primitive action which can be passed to the environment.

After a primitive action was passed to the environment and it transitioned to the next time step, a backward pass is executed that starts on the lowest level \(i=n-1\). This level now processes the transition, for example by adding it to a replay buffer. It can then decide whether to return control to the higher level or not, which determines whether the active level moves up one level or stays the same. When a level decides not to return control or the highest level is reached, the backward pass is complete.

The diagram below illustrates control flow in the hierarchy with the example of a two-level hierarchy. The higher level might pick a subgoal or skill vector initially, which is then pursued by the lower level until some condition is met, e.g. achievement of the subgoal or a timeout. Then control is returned (in the backward pass) to the higher level which picks a new abstract action. Hence, the higher a level, the lower the frequency with which it produces actions. The lowest level has to generate a primitive action for the environment every timestep, however.

Since the backward pass reaches higher levels less frequently, the transitions they observe stretch over longer time intervals. Formally, they interact with a Semi-MDP consisting of the environment and the part of the hierarchy which lies below the level (see diagram below). The resulting temporal abstraction can facilitate long-term credit assignment.


LayeredRL hierarchy structure

In general, there can be several environment instances in a vectorized environment, i.e., all actions and observations have an additional dimension. Hence, the bookkeeping in the hierarchy operates on vectors as well, so instead of a single active instance \(i\), the hierarchy keeps track of a vector of active instances \((i_0, \ldots, i_{m-1})\) where \(m\) refers to the number of environment instances.

Creating a Hierarchy

Basic Structure

from layeredrl.hierarchies import Hierarchy

hierarchy = Hierarchy(
    levels=[level0, level1, level2],
    env=env,
)

The levels are ordered from highest (most abstract) to lowest (most concrete).

Observation Maps

In some cases, it can make sense to manipulate environment observations before sending them to levels, for example to hide irrelevant details from higher levels. This can be achieved by specifying environment observation maps when creating the hierarchy:

from layeredrl.hierarchies import Hierarchy

def hide_something(obs):
    return obs[..., 2:]

hierarchy = Hierarchy(
    levels=[level0, level1],
    env=env,
    env_obs_maps=[hide_something, None],
    mapped_env_obs_shapes=[-2, None],
)

Setting the map for a level to None keeps the environment observation unchanged on this level. A negative entry in mapped_env_obs_shapes indicates that this many dimensions are missing from the mapped environment observation on this level compared to the full environment observation.

Data Collection and Training

The most convenient way to train a hierarchy is with a Collector object:

hierarchy.train()
collector = Collector(hierarchy=hierarchy, env=env)
collector.reset()
stats = collector.collect(n_steps=..., learn=True)

However, it is straightforward to set up a manual training loop:

hierarchy.train()
obs, info = env.reset(seed=seed)

for _ in range(n_vec_steps):
    # Get primitive action from hierarchy
    with torch.no_grad():
        action = hierarchy.get_action(obs)
    action = action.cpu().numpy()

    # Step environment
    obs_next, reward, terminated, truncated, info = env.step(action)

    # Handle resets
    if "_final_observation" in info:
        obs_next_transition = obs_next.copy()
        for i, env_final_obs in enumerate(info["final_observation"]):
            if env_final_obs is not None:
                obs_next_transition[i] = env_final_obs
    else:
        obs_next_transition = obs_next

    # Process transition
    hierarchy.process_transition(
        obs_next=obs_next_transition,
        rew=reward,
        terminated=terminated,
        truncated=truncated,
    )

    # Learn
    hierarchy.learn()

API Reference

For detailed API documentation, see Hierarchies API.