SIDM in Practice: A Distill-Style Guide from Paper to Code

Table of Contents

A reader-friendly, implementation-grounded introduction to the SIDM framework from the original paper to the public code.

TL;DR
#

SIDM (Structural Information-based Decision Making) is proposed in the paper “Hierarchical Decision Making Based on Structural Information Principles” (paper link).

The paper presents a unified abstraction idea; the code implements this in separate tracks:

SISA (state abstraction) built on RAD/CURL + SAC (RAD, CURL).
SISL (skill learning) built on a ReSkill-style hierarchical pipeline (ReSkill reference).
SIRD (role side, not the deep focus of this blog).

Core takeaway:

Paper: one conceptual SI framework.
Code: baseline-integrated engineering variants.

1) What the original paper proposes
#

The SIDM paper (arXiv HTML) proposes using Structural Information (SI) / structural entropy as a general principle to:

extract compact, meaningful abstraction structures,
model transition structure between abstractions,
improve hierarchical decision learning.

In plain terms, SIDM tries to answer:

“Can we make RL easier by learning better structure first, then learning control on top of that structure?”

1.1 Method-section description (paper view, simplified)
#

To keep this blog implementation-focused, we summarize the paper’s method section at the workflow level and treat the structural-entropy math blocks as a graph clustering method.

From that lens, the method section can be read as:

Build a graph from trajectories
- nodes represent states (or abstract units),
- edges represent transition relationships (with empirical weights/probabilities).
Run graph clustering / hierarchy extraction
- use the paper’s SI machinery to partition/merge nodes,
- form higher-level abstract states (communities) and their transition structure.
Use abstract structure to define higher-level decisions
- in state abstraction settings: regularize representation learning with abstract transition/action/reward structure,
- in skill settings: use clustered transition structure to support skill-level decision variables and context.
Train policy/policies on top of abstracted structure
- low-level control still optimizes RL objectives,
- high-level abstractions guide what latent features, contexts, or options the controller uses.
Iterate
- new data updates graph structure estimates,
- policy learning and abstraction learning co-evolve.

This “graph clustering -> abstraction -> policy optimization” view is the most practical bridge between paper method and repository implementation.

1.2 Key theory & formulas (paper ideas, intuitive version)
#

Below are the main theory ingredients you need to read the paper quickly, with simple interpretations.

(a) RL objective (control goal)
#

$$ J(\pi) = \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^{T}\gamma^t r_t\right] $$

Meaning:

maximize expected discounted return,
all SIDM abstractions are in service of improving this base control objective.

(b) Empirical transition graph from trajectories
#

Let $N_{ij}$ be how many times we observe transition $i \to j$ in data.

$$ \hat{P}(j \mid i) = \frac{N_{ij}}{\sum_k N_{ik}} $$

Meaning:

build a directed graph from trajectory counts,
normalize outgoing counts to get transition probabilities.

This is the practical bridge from raw trajectories to graph-structured abstraction.

(c) Graph clustering / abstraction mapping
#

Define an assignment $z_i \in \{1,\dots,K\}$ that maps original state node $i$ to abstract cluster/community $z_i$.

$$ \phi(i) = z_i $$

Meaning:

$\phi$ is the abstraction map,
SIDM methods differ mainly in how they estimate/refine this map from graph structure.

(d) Aggregate transition between abstract states
#

For abstract groups $a,b$, aggregate transition mass from members:

$$ \hat{P}_{ab} \propto \sum_{i:\phi(i)=a}\sum_{j:\phi(j)=b} \hat{P}(j\mid i) $$

Meaning:

collapse a large fine-grained graph into a smaller abstract graph,
this gives a higher-level dynamics view used for skills/contexts/losses.

(e) Objective decomposition intuition
#

Paper-level intuition can be summarized as:

$$ \text{Total objective} \approx \text{RL control loss} + \lambda \cdot \text{structure/abstraction regularization} $$

Meaning:

keep solving RL,
but bias representation/policy learning with graph-structure signals.
In code terms: this appears as extra SI losses (SISA) or SI-conditioned context channels (SISL).

(f) Why this helps (intuitive)
#

If two states have similar roles in transition structure, graph clustering maps them to similar abstractions. Then policy learning sees a simpler, more stable decision space:

less sensitivity to noisy local variation,
better reuse across similar situations,
easier hierarchical control.

2) One framework, two concrete code paths
#

Repository: SIDM codebase.

In practice, the implementation splits by baseline/task ecosystem:

SISA: SI as a representation-shaping branch inside a RAD/CURL-style SAC training loop.
SISL: SI as a context/goal abstraction module injected into a ReSkill-style hierarchy (SkillVAE + prior + PPO).

3) Method pipeline figure (high-level)
#

                +---------------------------------------------+
                |         SIDM Conceptual Pipeline            |
                +---------------------------------------------+
                    data/trajectories/states/actions
                                |
                                v
                  +------------------------------+
                  |   [SI] Structure Extraction  |
                  |   (graphs, partitions, trees)|
                  +------------------------------+
                                |
                                v
                  +------------------------------+
                  | [SI] Abstract Representation |
                  |   (goal/context/relations)   |
                  +------------------------------+
                                |
                                v
                  +------------------------------+
                  |  Baseline RL Backbone        |
                  |  (SAC or Hierarchical PPO)   |
                  +------------------------------+
                                |
                                v
                            policy update

[SI] marks components proposed/added by SIDM-style integration.

4) SISA track (RAD/CURL + SI) in depth
#

4.1 What SISA is essentially doing
#

Essence: SISA is model-free SAC control + SI-driven latent shaping. It does not replace the control algorithm; it regularizes the representation with structure-aware losses.

4.2 SISA pipeline figure
#

obs -> encoder -> actor/critic (SAC) ---------------------> action
        |
        +--> [SI] pretrain loss (inverse/contrastive/smoothness)
        +--> [SI] finetune loss (clustering KL)
        +--> [SI] abstract loss (transition/action/reward graphs)

4.3 SISA phase schedule
#

frequent base SI-pretrain updates,
periodic SI updates switch from finetune (early) to abstract (later).

This is a staged multi-objective curriculum over shared encoder parameters.

4.4 Graph building in SISA (with concrete mini-example)
#

Each abstract_sisa call builds partition-level graphs from the current minibatch pairings.

Suppose a minibatch yields 4 transition pairs after partition mapping:

Pair1: P0 -> P1, action=0.2, reward=1.0
Pair2: P0 -> P2, action=0.4, reward=0.0
Pair3: P1 -> P2, action=0.1, reward=0.5
Pair4: P0 -> P1, action=0.3, reward=-0.2

Then:

Transition graph counts
- P0->P1: 2, P0->P2: 1, P1->P2: 1
- source-normalized: from P0, weights become 2/3 and 1/3.
Action graph weights
- edge weights from action values per pair (implementation-specific projection).
Reward graph weights
- only positive rewards retained (1.0, 0.5), negative/zero ignored.

These graphs are rebuilt every call (batch-local, online estimates).

5) SISL track (ReSkill-style + SI) in depth
#

5.1 What SISL is essentially doing
#

Essence: SISL is hierarchical skill RL with SI-augmented context features. It mostly keeps the ReSkill control architecture, but changes what information the modules consume.

5.2 SISL pipeline figure
#

Stage A: demos ----------------------------------------------+
                                                              |
Stage B: Skill training                                       v
obs/actions -> SkillVAE + skill prior <- [SI] goal/context (from graphs)
                                                              |
Stage C: Hierarchical RL                                      v
high-level policy ----> latent skill ----> decoder action ----+--> env
                           ^                    |
                           |                    +--> residual policy refinement
                           +------ [SI] goal/context conditioning

[SI] marks SIDM-added abstraction path.

5.3 Graph building in SISL (with concrete mini-example)
#

SISL graph construction has two layers:

Undirected similarity graph over flattened observation vectors.
Directed community transition graph over adjacent timesteps.

Mini-example:

Trajectory communities across time: [C0, C1, C1, C2]
Adjacent transitions counted:
- C0->C1 (+1)
- C1->C1 (+1)
- C1->C2 (+1)
With many trajectories, counts aggregate and become weighted directed edges.
If SCCs are disconnected, small bridging edges are added to enforce strong connectivity.

Output is a structural community representation used as goal_state/context feature, not a planner rollout model.

6) Is SIDM code model-based RL?
#

Short answer: not in the standard planning/imagined-rollout sense.

Why people ask this:

SIDM builds explicit structural graphs from transitions, which resembles “world understanding”.

Why it is usually still considered model-free + model-informed:

no explicit MPC/tree-search/planning over learned dynamics,
no imagined trajectory rollout for policy improvement loop,
RL backbones still update from real sampled transitions/rollouts.

So a better label is:

structure-regularized model-free RL or
model-informed representation learning for RL.

7) Reproducible specification (appendix)
#

A. SISA reproducible spec
#

Inputs:

replay transitions (obs, action, reward, next_obs, done).

Modules:

SAC actor/critic/encoder,
SI branch with pretrain/finetune/abstract losses.

Per update:

sample replay minibatch,
update SAC critic (and actor/alpha on schedule),
run SI branch by phase schedule,
in abstract phase, rebuild partition-level transition/action/reward graphs from minibatch pair mapping,
backprop combined loss to encoder/SI head.

Inference:

actor path for action selection; SI affects policy through trained representation.

B. SISL reproducible spec
#

Inputs:

demo trajectories + online env rollouts.

Modules:

SkillVAE, skill prior (flow), high-level PPO, residual PPO,
SI modules (undirected abstraction + directed structural model).

Stage B (skill training):

compute SI graph abstractions from observation batches,
derive goal_state,
train SkillVAE/prior on [obs, goal_state]-conditioned inputs.

Stage C (hierarchical RL):

periodically recompute SI goal_state,
condition high-level and residual decisions on SI context,
execute environment steps and PPO updates.

8) Concise conclusions: what each method is really doing
#

SISA (one-line view):
“SAC with SI-based latent-space shaping and graph-regularized feature engineering.”
SISL (one-line view):
“ReSkill-style hierarchy with SI-derived context features injected into skill learning and control.”
Framework-level view:
“SIDM is less a single runtime algorithm and more a transferable structural-abstraction principle applied to different RL backbones.”

References
#

SIDM paper: https://arxiv.org/html/2404.09760v2
SIDM code: https://github.com/SELGroup/SIDM
RAD: https://github.com/MishaLaskin/rad
CURL: https://github.com/MishaLaskin/curl
ReSkill reference: https://github.com/krishanrana/reskill

TL;DR#

1) What the original paper proposes#

1.1 Method-section description (paper view, simplified)#

1.2 Key theory & formulas (paper ideas, intuitive version)#

(a) RL objective (control goal)#

(b) Empirical transition graph from trajectories#

(c) Graph clustering / abstraction mapping#

(d) Aggregate transition between abstract states#

(e) Objective decomposition intuition#

(f) Why this helps (intuitive)#

2) One framework, two concrete code paths#

3) Method pipeline figure (high-level)#

4) SISA track (RAD/CURL + SI) in depth#

4.1 What SISA is essentially doing#

4.2 SISA pipeline figure#

4.3 SISA phase schedule#

4.4 Graph building in SISA (with concrete mini-example)#

5) SISL track (ReSkill-style + SI) in depth#

5.1 What SISL is essentially doing#

5.2 SISL pipeline figure#

5.3 Graph building in SISL (with concrete mini-example)#

6) Is SIDM code model-based RL?#

7) Reproducible specification (appendix)#

A. SISA reproducible spec#

B. SISL reproducible spec#

8) Concise conclusions: what each method is really doing#

References#

TL;DR
#

1) What the original paper proposes
#

1.1 Method-section description (paper view, simplified)
#

1.2 Key theory & formulas (paper ideas, intuitive version)
#

(a) RL objective (control goal)
#

(b) Empirical transition graph from trajectories
#

(c) Graph clustering / abstraction mapping
#

(d) Aggregate transition between abstract states
#

(e) Objective decomposition intuition
#

(f) Why this helps (intuitive)
#

2) One framework, two concrete code paths
#

3) Method pipeline figure (high-level)
#

4) SISA track (RAD/CURL + SI) in depth
#

4.1 What SISA is essentially doing
#

4.2 SISA pipeline figure
#

4.3 SISA phase schedule
#

4.4 Graph building in SISA (with concrete mini-example)
#

5) SISL track (ReSkill-style + SI) in depth
#

5.1 What SISL is essentially doing
#

5.2 SISL pipeline figure
#

5.3 Graph building in SISL (with concrete mini-example)
#

6) Is SIDM code model-based RL?
#

7) Reproducible specification (appendix)
#

A. SISA reproducible spec
#

B. SISL reproducible spec
#

8) Concise conclusions: what each method is really doing
#

References
#