TouchAnything

The First Dataset and Framework for Bimanual Tactile Estimation from Egocentric Video

Jianyi Zhou1, Ziteng Gao1, Feiyang Hong1, Zirui Liu1, Guannan Zhang1, Weisheng Dai1,
Ruichen Zhen2, Chuqiao Lyu2, Haotian Wu2, Yinian Mao2, Xushi Wang1, Yuxiang Jiang1, Shuo Yang1✉

1Harbin Institute of Technology, Shenzhen    2Meituan Academy of Robotics
Corresponding author: shuoyang@hit.edu.cn

Overview

EgoTouch is the first large-scale multi-view tactile dataset for egocentric hand-object interaction, featuring 302 diverse manipulation tasks across 4,530 episodes in both indoor and outdoor environments. The dataset provides synchronized multi-view video (egocentric + dual wrist cameras), bimanual 3D hand pose (42 joints), and dense continuous pressure maps from wearable tactile sensors. By combining global scene context with wrist-level observations of hand-object contact, TouchAnything enables comprehensive modeling of tactile feedback under realistic occlusion and viewpoint variation.

TouchAnything Overview

Abstract

Vision has become the dominant sensing modality for hand-object interaction, yet it remains insufficient for dexterous manipulation as it cannot directly capture contact forces, pressure distributions, or subtle contact state changes. While tactile sensing provides these critical cues, high-quality tactile hardware is difficult to deploy at scale, motivating growing interest in vision-to-touch prediction, which aims to infer tactile feedback from visual observations. However, progress in this area is fundamentally constrained by the lack of large-scale, diverse, and high-fidelity datasets. Existing datasets are often limited in scale, restricted to single-view capture or narrow interaction scenarios, or lack dense tactile ground truth. In particular, egocentric single-view observations suffer from severe occlusion of hand-object contact regions, making accurate tactile inference highly challenging. To address these limitations, we introduce EgoTouch, a large-scale multi-view tactile dataset for egocentric hand-object interaction. EgoTouch comprises 302 diverse manipulation tasks spanning 4,530 episodes across both indoor and outdoor environments, with synchronized multi-view observations (egocentric and dual wrist-mounted cameras), bimanual 3D hand pose annotations, and dense continuous pressure maps captured by wearable tactile sensors. This comprehensive setup enables learning cross-view consistent representations of contact dynamics under realistic occlusion and interaction variability. Building upon EgoTouch, we further propose TouchAnything, a unified architecture for multi-view tactile prediction. The model integrates a shared DINOv2 encoder, learnable view embeddings, cross-view attention, and a view dropout strategy, enabling flexible inference under varying view availability. Effectively leveraging multi-view observations is non-trivial due to view inconsistency and partial observability, which our design explicitly addresses. Extensive experiments demonstrate that incorporating complementary wrist views is crucial for resolving occluded contact regions and enabling reliable tactile inference, particularly under severe occlusions. These results highlight the importance of multi-view perception for tactile reasoning in real-world interaction scenarios. We will publicly release the dataset, code, and benchmark to facilitate future research bridging visual perception and physical interaction in embodied systems.

Key Features

Multi-View Capture

First dataset combining multi-view synchronized video (egocentric + dual wrist cameras) with real tactile pressure data

Dense Tactile Sensing

Real continuous pressure distributions from wearable sensors, capturing fine-grained contact dynamics

Bimanual Interaction

Bimanual manipulation with 42-joint 3D hand pose annotations, enabling analysis of coordinated hand-object interaction

Synchronized Modalities

Precise frame-level synchronization across video, pose, and pressure, enabling accurate temporal modeling of contact events

Dataset Statistics

302 Manipulation Tasks
4.5K Episodes
3 Camera Views
42 Hand Joints

Dataset Comparison

EgoTouch is the first dataset to jointly provide multi-view video, bimanual hand pose, and real dense pressure data across diverse scenes.

Dataset In-the-wild Hand Pose Contact Wrist Views Hands Objects Frames
GRAB MoCap Analytical Biman. 51 1.6M
ContactDB Thermal Biman. 50 375k
ARCTIC MoCap Analytical Biman. 11 2.1M
OakInk MoCap Analytical Single 100 230k
DexYCB Est. Single 20 582k
ActionSense Glove Pressure Biman. 21 521k
HOI4D Est. Single 800 2.4M
EgoPressure Est. Pressure Single 31 4.3M
EgoDex Est. Analytical Biman. 500 90M
OpenTouch Glove Pressure Single 800 ~500k
EgoTouch (Ours) Glove Pressure Biman. 1000 2M

Main Contributions

Data Collection Setup

Data Collection Setup

Multi-sensor data collection system with head-mounted egocentric camera, dual wrist cameras, and pressure-sensing gloves.

Our data collection system integrates:

Example Multi-Modal Data

Example Data

Example showing synchronized multi-view video, hand pose, and pressure maps revealing contact information invisible to vision alone.

Dataset Analysis

Dataset Statistics

Comprehensive statistics showing diverse pressure patterns, balanced task coverage, and rich contact interactions.

Method Architecture

Model Architecture

Multi-view tactile prediction architecture with shared vision encoding, cross-view attention, and pose-aware fusion.

We propose a multi-view tactile prediction model based on:

This design allows the model to work with any subset of available views at inference time, from ego-only to full multi-view, making it practical for real-world deployment.

Model Inference Results

Real-world tactile prediction results on diverse manipulation tasks. Our model accurately predicts contact regions and pressure distributions from multi-view video input.

Grasping Thermos

Multi-view tactile prediction during bimanual thermos manipulation

Handling Hair Dryer

Contact prediction on complex-shaped household appliances

Grasping Beverage

Accurate pressure estimation during bottle manipulation

Picking Up Mouse

Precise contact localization on small objects

Bouncing Ping-Pong Ball

Temporal contact prediction during dynamic interaction

USB Insertion

Fine manipulation with dynamic contact changes

These results demonstrate our model's ability to predict realistic tactile feedback across diverse manipulation scenarios, from static grasping to dynamic interactions.

Data Scaling Analysis

Data Scaling

Performance improves consistently with more training data, demonstrating the model's ability to leverage large-scale datasets.

Qualitative Results

Qualitative Results

Tactile prediction results showing accurate contact region and pressure intensity prediction across diverse manipulation tasks.

Applications

Core Applications

Robotic Manipulation

See contact. Predict force. Enable reliable manipulation under occlusion.

Interaction Understanding

Move beyond vision. Understand contact dynamics, grasp quality, and manipulation skills.

Broader Impact

AR/VR Haptics

Bring touch into virtual worlds through vision-driven haptic feedback.

Assistive Prosthetics

Support vision-based sensory feedback for more intuitive prosthetic control.

Author Contributions

This work represents a collaborative effort across multiple domains, from hardware setup to model development and data collection.

Jianyi Zhou

  • Paper writing and manuscript preparation
  • Model architecture design and training
  • Project website development
  • Data collection participation

Ziteng Gao

  • Data collection pipeline development
  • Paper figure creation (partial)
  • Experimental equipment management
  • Data collection participation

Feiyang Hong

  • Codebase organization and documentation
  • Paper figure optimization
  • Data collection participation

Zirui Liu

  • Paper figure design and illustration
  • Data collection participation

Guannan Zhang

  • Project website video editing
  • Data collection participation

Weisheng Dai

  • Experimental equipment management
  • Data collection participation

Ruichen Zhen

  • Equipment procurement and acquisition
  • Data collection hardware configuration

Chuqiao Lyu

  • Data collection system setup and deployment

Haotian Wu

  • Data collection hardware setup support

Yinian Mao

  • High-level guidance and project support

Xushi Wang

  • HaMeR inference framework development
  • Data collection participation

Yuxiang Jiang

  • Data collection participation

Shuo Yang (Corresponding Author)

  • Research ideation and project conceptualization
  • Overall project supervision and research direction
  • Model design and architecture consultation
  • Technical guidance on system development and methodology
  • Strategic planning of research scope and objectives
  • Manuscript revision and high-level feedback