AINA🪞 | Dexterity from Smart Lenses: Multi-Fingered Robot Manipulation with In-the-Wild Human Demonstrations
Irmak Guzey1,2, Haozhi Qi2, Julen Urain2, Changhao Wang2, Jessica Yin2, Krishna Bodduluri2, Mike Maroje Lambeta2, Lerrel Pinto1, Akshara Rai2, Jitendra Malik2, Tingfan Wu2, Akash Sharma2, Homanga Bharadhwaj2
1 New York University   2 Meta
Official repository for Dexterity from Smart Lenses: Multi-Fingered Robot Manipulation with In-the-Wild Human Demonstrations. The project website can be found at aina-robot.github.io.
This repository includes code for preprocessing in-the-wild Aria demonstrations, in-scene demonstrations, domain alignment between them, and training point-based policies. We provide a Quick Start section that demonstrates how to process Aria Gen2 demonstrations to obtain 3D human demonstrations with aligned object and hand points.
Feel free to reach out to [email protected] with any questions regarding this repository.
- Installation
- Quick Start
- Calibration
- Data Collection
- Data Processing
- Training Point-Based Policies
- Citation
git clone --recurse-submodules https://github.com/facebookresearch/AINA.git
conda env create -f conda_env.yaml
conda activate aina
pip install -e .
Run instructions on Aria 2 Client-SDK documentation, to verify Client SDK installation.
AINA uses FoundationStereo, CoTracker and GroundedSAM to extract and track object-specific 3D points. And it uses Hamer to get hand poses at an in-scene demonstration.
For following submodules, cd to their root directory and run the following commands:
Co-Tracker
cd submodules/co-tracker
pip install -e .
Grounded-SAM-2
cd submodules/Grounded-SAM-2
cd checkpoints
bash download_ckpts.sh
cd ../gdino_checkpoints
bash download_ckpts.sh
Hamer (Needed for Processing In-Scene Demonstration)
cd submodules/hamer
pip install -e .[all] --no-build-isolation
cd third-party
git clone https://github.com/ViTAE-Transformer/ViTPose.git
pip install -v -e third-party/ViTPose
Make sure to download checkpoints for Co-Tracker and FoundationStereo to corresponding folder:
- FoundationStereo checkpoints under
submodules/FoundationStereo/pretrained_models/23-51-11, - CoTracker2 checkpoints to
submodules/co-tracker/checkpoints/cotracker2.pth
We provide one Aria demonstration and one in-scene demonstration to showcase preprocessing. These demonstrations are hosted on an OSF project.
Install osfclient and run:
bash download_data.sh
This will download Aria and human demonstrations under data/osfstorage.
ROS2 Installation
AINA uses ROS2 for controlling Ability hand and getting Realsense readings on an Ubuntu 22.04 workstation. Please follow instructions on ROS2 Humble installation guide to install ROS2 Humble. If you are not using Ability hand and can implement your own camera drivers, you do not need ROS2.
Note: Moving forward in this documentation, for Calibration and Human Data Collection sections, we assume that ROS2 drivers for each Realsense camera is running on a separate process. For us the way we initialize these drivers are as follows:
For right camera:
ros2 launch realsense2_camera rs_launch.py camera_namespace:=realsense camera_name:=right_camera serial_no:='"934222072381"' pointcloud.enable:=true align_depth.enable:=true
For left camera:
ros2 launch realsense2_camera rs_launch.py camera_namespace:=realsense camera_name:=left_camera serial_no:='"925622070557"' pointcloud.enable:=true align_depth.enable:=true
Also, you should update REALSENSE_CAMERA_IDS, LEFT_INTRINSICS and RIGHT_INTINSICS constants at aina/utils/constants.py with respect to your use case.
Robot Driver Installation
We control Kinova arm with Kortex API and Ability hand with their Python API. Install these drivers if you'd like to reproduce robot deployment as well.
Run bash download_data.sh to download example demonstrations.
In order to process Aria demonstrations and extract 3D object points aligned with hand points, you can run the following script. This script will:
- Extract metric stereo depth from two front SLAM cameras of downloaded Aria demonstration using FoundationStereo.
- Segment and track objects in 2D with language prompts using GroundedSAM2 and CoTracker.
- Project those tracks into 3D.
- Visualize object points and hand detections in 2D and 3D using Rerun.
import os
from aina.preprocessing.aria.depth_extractor import VRSDepthExtractor
from aina.preprocessing.aria.object_tracker import ObjectTracker
from aina.preprocessing.aria.vrs_demo import VRSDemo, VRSProcessorConfig
from aina.utils.file_ops import get_repo_root
if __name__ == "__main__":
vrs_demo = VRSDemo(
os.path.join(get_repo_root(), "data/osfstorage/aria_data", "trimmed_stewing.vrs"),
VRSProcessorConfig(),
)
depth_extractor = VRSDepthExtractor(vrs_demo)
object_tracker = ObjectTracker(
vrs_demo, depth_extractor, text_prompt=["bowl", "toaster oven"]
)
points_2d, points_3d = object_tracker.get_demo_points(visualize=True)
You can also run python preprocess_aria_demo.py to run this.
Your expected output should be a Rerun visualizer that looks like the following:
AINA assumes access to a calibrated environment. Here we provide code to apply hand-eye calibration on an environment with two Realsense cameras and an Aruco marker mount that can be attached to the end of a robot arm. Print an aruco marker of size 0.055 with 4x4_50 dictionary with ID 0 and attach it to this mount. And, run:
python hand_eye_calibration.py
Then move your robot arm using a joystick with the marker mount attached to different poses. Press Enter each time you capture an image of the environment. Collect approximately 30 poses per camera (each camera must observe the ArUco marker for at least 30 poses for accuracy), then press Ctrl+C to terminate.
The script will save the calibration data and compute the 2D pixel reprojection error. We typically expect this error to be below 5 pixels per ArUco marker corner.
This code can be used for any mount and any robot, but if you're using it for different robots you would need to edit WRIST_TO_EEF constant at aina/utils/constants.py to reflect your robot setup.
This script will print camera-to-base transforms for all cameras. Make sure to update these constants at aina/utils/constants.py accordingly:
LEFT_TO_BASE = np.array( # NOTE: This should be updated!!!
[
[-0.72, -0.56, 0.42, -0.01],
[-0.69, 0.51, -0.51, 0.65],
[0.07, -0.65, -0.75, 0.5],
[0.0, 0.0, 0.0, 1.0],
]
)
RIGHT_TO_BASE = np.array( # NOTE: This should be updated!!!
[
[0.96, 0.16, -0.22, 0.45],
[0.27, -0.71, 0.65, -0.19],
[-0.05, -0.69, -0.73, 0.51],
[0.0, 0.0, 0.0, 1.0],
]
)
Follow instructions on Aria 2 Recording documentation to record a demonstration using the Companian App. After you record a demonstration and download it to your workstation, you will have a .vrs file. This repo contains code for preprocessing that .vrs recording.
AINA uses a single in-scene human demonstration as an anchor to ground the Aria 2 demonstrations to the same environment as the robot.
In one terminal run:
python start_camera_servers.py
At another terminal run:
python collect_human_demonstration.py --task_name <task-name> --demo_num <demonstration-number>
Ctrl+C will terminate and save the demonstration to ./human_data/{task_name}/demo_{demo_num}, edit the script if you'd like to change the recording location.
python preprocess_aria_demo.py
This script will dump points-3d.npy under Aria demo root (data/osfstorage/aria_data). This numpy array has object points with respect to the world frame of the Aria glasses.
python preprocess_in_scene_demo.py
This script will dump object-poses-in-base.npy and hand-poses-in-base.npy to the in-scene demo root (data/osfstorage/human_data). These arrays hold object points and hand keypoints with respect to the base of the Kinova arm. This script requires around 15GB GPU RAM, in case you don't have that and you get CUDA allocation errors, we provided dumped .npy files from that script in order to proceed to the next step.
NOTE: Here, if you run into an issue as:
ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
Try lowering NumPy version to 1.26.4 and retry.
python align_aria_to_in_scene.py
This script will dump object-poses-in-base.npy and hand-poses-in-base.npy to the Aria demo root (data/osfstorage/aria_data).
These points are now expressed in the base frame of the Kinova arm.
python train.py root_dir=.
This script will start training Vector Neurons based Point-Policy architecture mentioned on the paper, using both the Aria and In-Scene demonstration. Model weights and logs will be saved under {root_dir}/aina-trainings/. In order to edit hyperparameters, you can refer to cfgs/train.yaml .
If you find the code in this repo helpful, please consider citing our paper
@misc{guzey2025aina,
title={Dexterity from Smart Lenses: Multi-Fingered Robot Manipulation with In-the-Wild Human Demonstrations},
author={Irmak Guzey and Haozhi Qi and Julen Urain and Changhao Wang and Jessica Yin and Krishna Bodduluri and Mike Lambeta and Lerrel Pinto and Akshara Rai and Jitendra Malik and Tingfan Wu and Akash Sharma and Homanga Bharadhwaj},
year={2025},
eprint={2511.16661},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2511.16661},
}AINA is CC-BY-NC licensed, as found in the LICENSE file.
AINA uses several open-source libraries and packages like PyTorch, Numpy, SAM, Co-Tracker, Hamer, Foundation-Stereo, OpenCV, Aria-SDK, ROS for training/evaluation and Rerun for visualization. We are grateful to the open-source community for this!

