Hierarchical State Space Models for Continuous Sequence-to-Sequence Modeling

1Carnegie Mellon Universiy; 2FAIR, Meta; 3New York University

Abstract

Reasoning from sequences of raw sensory data is a ubiquitous problem across fields ranging from medical devices to robotics. These problems often involve using long sequences of raw sensor data (e.g. magnetometers, piezoresistors) to predict sequences of desirable physical quantities (e.g. force, inertial measurements).While classical approaches are powerful for locally-linear prediction problems, they often fall short when using real-world sensors. These sensors are typically non-linear, are affected by extraneous variables (e.g. vibration), and exhibit data-dependent drift. For many problems, the prediction task is exacerbated by small labeled datasets since obtaining ground-truth labels requires expensive equipment. In this work, we present Hierarchical State-Space Models (HiSS), a conceptually simple, new technique for continuous sequential prediction. HiSS stacks structured state-space models on top of each other to create a temporal hierarchy. Across six real-world sensor datasets, from tactile-based state prediction to accelerometer-based inertial measurement, HiSS outperforms prior sequence models such as causal transformers, LSTMs, S4, and Mamba by at least 23% on MSE. Our experiments further indicate that HiSS demonstrates efficient scaling to smaller datasets and is compatible with existing data-filtering techniques.

Video

HiSS: Hierarchical State Space Models

Model Architecture

HiSS Model Architecture

(Left) Flat SSM directly maps a sensor sequence to an output sequence.

(Right) HiSS divides an input sequence into chunks which are processed into chunk features by a low-level SSM. A high-level SSM maps the resulting sequence to an output sequence.

Experiments and Results

HiSS Model Results

Comparison of MSE prediction losses for flat and HiSS models on CSP-Bench. Reported numbers are averaged over 5 seeds for the best performing models. MW: Marker Writing, IS: Intrinsic Slip, R:RoNIN, V: VECtor, JC: Joystick Control, TC: TotalCapture

CSP Datasets

Marker Writing Dataset

Including 1000 trajectories, each lasting 15-30 seconds, recorded as a robot moves a grasped marker to 8-12 randomly chosen locations within a 10 x 10 cm² workspace to make linear strikes on paper.

Prediction task: predict the strike velocity (δx/δt, δy/δt), given the ReSkin tactile signals.

Intrinsic Slip Dataset

Including 1100 trajectories, each lasting 25-30 seconds, recorded as a robot grasping and slipping along different boxes clamped to a table. We used 10 distinct boxes and 4 sets of skins for 25 trajectories per combined pair.

Prediction task: predict velocity and orientation of the end-effector (δx/δt, δy/δt, δθ/δt), given the ReSkin tactile signals.

Joystick Control Dataset

Including 1000 trajectories, each lasting 25-40 seconds, recorded as Allegro Hand with Xela sensors mounted on a Franka arm interacting with an Extreme3D Pro Joystick controlled.

Prediction task: predict X, Y and Z-twis states from the joystick, given the Xela tactile signals.

RoNIN Dataset[1]

RoNIN - IMU data from smartphone from 100 human subjects with ground-truth 3D trajectories under natural human motions.


Prediction task: predict X, Y axes velocity, given the IMU acceleration and gyroscope readings.

RONIN a006_2 sequence

VECtor Dataset[2]

VECtor - Indoor Multi-Sensor SLAM dataset collected by handheld platforms with versatile motion types.

Prediction task: predict X, Y and Z axes velocity, given the IMU acceleration and gyroscope readings.

VECtor Desk Normal sequence

TotalCapture Dataset[3]

TotalCapture - 3D human pose estimation dataset with muli-view video, IMU and Vicon labelling.

Prediction task: predict X, Y and Z axes velocity of 21 joints, given the 12 IMUs' acceleration and orientation readings.

ToalCapture S1 acting1 sequence