The First Dataset and Framework for Bimanual Tactile Estimation from Egocentric Video
1Harbin Institute of Technology, Shenzhen
2Meituan Academy of Robotics
✉Corresponding author: shuoyang@hit.edu.cn
EgoTouch is the first large-scale multi-view tactile dataset for egocentric hand-object interaction, featuring 302 diverse manipulation tasks across 4,530 episodes in both indoor and outdoor environments. The dataset provides synchronized multi-view video (egocentric + dual wrist cameras), bimanual 3D hand pose (42 joints), and dense continuous pressure maps from wearable tactile sensors. By combining global scene context with wrist-level observations of hand-object contact, TouchAnything enables comprehensive modeling of tactile feedback under realistic occlusion and viewpoint variation.
Vision has become the dominant sensing modality for hand-object interaction, yet it remains insufficient for dexterous manipulation as it cannot directly capture contact forces, pressure distributions, or subtle contact state changes. While tactile sensing provides these critical cues, high-quality tactile hardware is difficult to deploy at scale, motivating growing interest in vision-to-touch prediction, which aims to infer tactile feedback from visual observations. However, progress in this area is fundamentally constrained by the lack of large-scale, diverse, and high-fidelity datasets. Existing datasets are often limited in scale, restricted to single-view capture or narrow interaction scenarios, or lack dense tactile ground truth. In particular, egocentric single-view observations suffer from severe occlusion of hand-object contact regions, making accurate tactile inference highly challenging. To address these limitations, we introduce EgoTouch, a large-scale multi-view tactile dataset for egocentric hand-object interaction. EgoTouch comprises 302 diverse manipulation tasks spanning 4,530 episodes across both indoor and outdoor environments, with synchronized multi-view observations (egocentric and dual wrist-mounted cameras), bimanual 3D hand pose annotations, and dense continuous pressure maps captured by wearable tactile sensors. This comprehensive setup enables learning cross-view consistent representations of contact dynamics under realistic occlusion and interaction variability. Building upon EgoTouch, we further propose TouchAnything, a unified architecture for multi-view tactile prediction. The model integrates a shared DINOv2 encoder, learnable view embeddings, cross-view attention, and a view dropout strategy, enabling flexible inference under varying view availability. Effectively leveraging multi-view observations is non-trivial due to view inconsistency and partial observability, which our design explicitly addresses. Extensive experiments demonstrate that incorporating complementary wrist views is crucial for resolving occluded contact regions and enabling reliable tactile inference, particularly under severe occlusions. These results highlight the importance of multi-view perception for tactile reasoning in real-world interaction scenarios. We will publicly release the dataset, code, and benchmark to facilitate future research bridging visual perception and physical interaction in embodied systems.
First dataset combining multi-view synchronized video (egocentric + dual wrist cameras) with real tactile pressure data
Real continuous pressure distributions from wearable sensors, capturing fine-grained contact dynamics
Bimanual manipulation with 42-joint 3D hand pose annotations, enabling analysis of coordinated hand-object interaction
Precise frame-level synchronization across video, pose, and pressure, enabling accurate temporal modeling of contact events
EgoTouch is the first dataset to jointly provide multi-view video, bimanual hand pose, and real dense pressure data across diverse scenes.
| Dataset | In-the-wild | Hand Pose | Contact | Wrist Views | Hands | Objects | Frames |
|---|---|---|---|---|---|---|---|
| GRAB | MoCap | Analytical | Biman. | 51 | 1.6M | ||
| ContactDB | Thermal | Biman. | 50 | 375k | |||
| ARCTIC | MoCap | Analytical | Biman. | 11 | 2.1M | ||
| OakInk | MoCap | Analytical | Single | 100 | 230k | ||
| DexYCB | Est. | Single | 20 | 582k | |||
| ActionSense | Glove | Pressure | Biman. | 21 | 521k | ||
| HOI4D | Est. | Single | 800 | 2.4M | |||
| EgoPressure | Est. | Pressure | Single | 31 | 4.3M | ||
| EgoDex | Est. | Analytical | Biman. | 500 | 90M | ||
| OpenTouch | Glove | Pressure | Single | 800 | ~500k | ||
| EgoTouch (Ours) | Glove | Pressure | Biman. | 1000 | 2M |
Multi-sensor data collection system with head-mounted egocentric camera, dual wrist cameras, and pressure-sensing gloves.
Our data collection system integrates:
Example showing synchronized multi-view video, hand pose, and pressure maps revealing contact information invisible to vision alone.
Comprehensive statistics showing diverse pressure patterns, balanced task coverage, and rich contact interactions.
Multi-view tactile prediction architecture with shared vision encoding, cross-view attention, and pose-aware fusion.
We propose a multi-view tactile prediction model based on:
This design allows the model to work with any subset of available views at inference time, from ego-only to full multi-view, making it practical for real-world deployment.
Real-world tactile prediction results on diverse manipulation tasks. Our model accurately predicts contact regions and pressure distributions from multi-view video input.
Grasping Thermos
Multi-view tactile prediction during bimanual thermos manipulation
Handling Hair Dryer
Contact prediction on complex-shaped household appliances
Grasping Beverage
Accurate pressure estimation during bottle manipulation
Picking Up Mouse
Precise contact localization on small objects
Bouncing Ping-Pong Ball
Temporal contact prediction during dynamic interaction
USB Insertion
Fine manipulation with dynamic contact changes
These results demonstrate our model's ability to predict realistic tactile feedback across diverse manipulation scenarios, from static grasping to dynamic interactions.
Performance improves consistently with more training data, demonstrating the model's ability to leverage large-scale datasets.
Tactile prediction results showing accurate contact region and pressure intensity prediction across diverse manipulation tasks.
See contact. Predict force. Enable reliable manipulation under occlusion.
Move beyond vision. Understand contact dynamics, grasp quality, and manipulation skills.
Bring touch into virtual worlds through vision-driven haptic feedback.
Support vision-based sensory feedback for more intuitive prosthetic control.
This work represents a collaborative effort across multiple domains, from hardware setup to model development and data collection.