Skip to content

2026.02.25 - #63 - Visual SLAM roadmap, GEAR-SONIC, AMD+Meta, Universal Beta Splatting, Reconstruct Anything, Qwen3.5, TinyClaw, Chroma, PointCloudCrafter, Splat Feature Solver #66

@changh95

Description

@changh95

📘 Study Roadmap : Visual-SLAM (Beginner → Master)

Created: 2026-02-22


Level 1: Beginner

Programming

  • C++: Pointer, OOP
  • Python
  • Bash/Linux: Basic terminal usage

Mathematics

  • Basic Probability & Statistics: Gaussian distribution, Bayes' theorem
  • Basic Linear Algebra: Vectors & Matrices, Determinant, Dot & Cross product, Rank, Inverse matrix, Transpose matrix, SVD, Eigenvalues/Eigenvectors
  • Logarithm & Exponential
  • Basic Calculus: Differentiation, Taylor expansion

Projective Geometry

  • Pinhole camera model → Image projection
  • Camera calibration: Intrinsic/Extrinsic parameters, Lens distortion
  • Rigid body motion: Euler/Quaternion/Rotation Matrix, Projective space & Vanishing point, Homogeneous transformation
  • Epipolar geometry → Essential & Fundamental matrix
  • Triangulation

Camera Device

  • Lens, Sensor, Resolution/ISO/Aperture

Image Data

  • Colour image, Resolution, Grayscale image
  • Thresholding, Gaussian blur
  • Corner detector: Harris corner
  • Edge detector: Sobel & Canny Edge
  • Stereovision, RGB-D, Disparity, Depth

Level 2: Getting Familiar with SLAM

Programming

  • C++: OOP, Modern C++, Data structures & Algorithms, Compilers, CMake/Makefile/Ninja, Design patterns, OpenCV C++
  • C
  • Git/GitHub
  • OpenCV (opencv-python)
  • Python: Deep learning, Graph plots, System scripts
  • Bash/Linux: ssh, CLI text editor/Vim/tmux
  • Concurrency: SIMD-SSE/AVX/Neon, OpenMP, CUDA
  • Mobile: Android (Java/Kotlin), iOS (Objective-C/Swift)
  • Maths library: Eigen, Ceres-solver/GTSAM/g2o
  • C++/Python interop: PyBind11, nanobind
  • Docker
  • C#: COLMAP, Unity AR, Microsoft Hololens
  • CI/CD: GitHub Actions, Apache Airflow
  • ROS/ROS2
  • Simulation: Gazebo, Isaac Sim

Image Processing

  • Keypoints → Detector/Descriptor
    • SIFT, FAST, ORB, AKAZE
    • Deep features: R2D2, Superpoint
  • Image pyramid, oFAST, rBRIEF

Local Feature Matching

  • Brute-Force, FLANN, Kd-Tree
  • LSH, Multi-probe LSH, HBST
  • Superglue

Global Feature Matching

  • Bag of Visual Words, NetVLAD
  • Deep image retrieval, Hierarchical localization

Feature Tracking

  • Optical flow, KLT Tracker

Multiple View Geometry

  • 2D-2D correspondence: Essential/Fundamental, Homography
  • 2D-3D correspondence: P3P, PnP, SVD
  • 3D-3D correspondence: ICP

Outlier Rejection

  • RANSAC, PROSAC, M-Estimator, MAXCON, Convex relaxation

Least Squares Optimisation

  • Reprojection error, Bundle adjustment
  • Non-linear optimisation, Lie algebra
  • Lie groups: SO(3), SE(3)
  • Gauss-Newton, Levenberg-Marquardt
  • Pose graph optimization
  • Schur complement / Sparsity

Motion Model

  • Proprioceptive sensor: IMU, Wheel
  • Odometry (pose)

Observation Model

  • Exteroceptive sensor: Camera, LiDAR
  • Landmark (Map)
  • Joint optimisation, MLE & MAP

Factor Graph Optimisation

Mapping

  • Point cloud, Occupancy grid mapping, TSDF, Surfel, Voxel map

Sensors

  • Camera device: Wide/telecentric lens, Lens MTF, CCD/CMOS, Rolling/Global shutter, Exposure/ISO, Stereovision, RGB-D, Structured light, Active IR/ToF
  • LiDAR → Visual-LiDAR fusion
  • IMU → VIO
  • RADAR → Sensor fusion, Extended Kalman filter
  • Sonar
  • Multi-sensor calibration: Camera-IMU, Camera-LiDAR

Evaluation

  • Metrics: ATE (Absolute Trajectory Error), RPE (Relative Pose Error)
  • Datasets: KITTI, TUM RGB-D, EuRoC

Next Levels

Monocular SLAM · VIO/VINS · Stereo SLAM · Visual-LiDAR Fusion · RGB-D SLAM · Collaborative SLAM · Deep SLAM/Localization


Level 3: Monocular Visual-SLAM

Key Concepts

  • VO vs SLAM — VO is local (no loop closure), SLAM includes global map + loop closure
  • Scale ambiguity — Fundamental limitation of monocular SLAM; absolute scale is unrecoverable from images alone
  • Covisibility graph — Shared map point visibility between keyframes; core data structure in ORB-SLAM
  • Visual Place Recognition (VPR) — Recognising previously visited places for loop closure
  • Self-supervised depth — Learning monocular depth without ground truth (Monodepth2, Godard 2019)

Feature-based SLAM

System Author/Year Key Concepts
Visual Odometry Nister 2004 Fundamental matrix, Triangulation, VO (local-only, no loop closure)
PTAM Klein & Murray 2007 FAST feature, Tracking, Frontend/Backend separation, Parallel threads, Keyframe, Mapping, Bundle adjustment, Manual initialisation
Visual-SLAM why filter? Strasdat 2012 Bundle adjustment, Scale-aware BA, Motion-only BA
ORB-SLAM Mur-Artal 2015 ORB keypoint, Automatic initialisation (Homography vs Fundamental selection), Tracking thread, Sliding-window BA, Local mapping, Large-scale, Loop closure, Bag of visual words, Global optimisation, Covisibility graph, Map point management (culling, merging)
Pop-up SLAM Yang 2016 Line/Plane features
PL-SLAM Pumarola 2017 Point/Line features
ORB-SLAM2 Mur-Artal 2017 → Stereo SLAM, → RGB-D SLAM
CubeSLAM Yang 2019 Monocular 3D cuboid detection + SLAM, 9-DoF object representation
OpenVSLAM Sumikura 2019
Stella-VSLAM (fork) 2021 OpenVSLAM successor, license reboot
UcoSLAM Munoz-Salinas 2019 Fiducial markers
DeepFusion LaidLow 2019
ORB-SLAM3 Campos 2020 Monocular + Stereo + VIO, Multi-map, IMU integration
DXSLAM Li 2020 Deep features for SLAM
PyCuVSLAM NVIDIA 2026 Python + CUDA GPU-accelerated VSLAM toolkit (cuVSLAM wrapper)

Direct SLAM

System Author/Year Key Concepts
DTAM Newcombe 2011 Dense mapping, Keyframe mapping, GPGPU
LSD-SLAM Engel 2014 Photometric error minimisation, High gradient pixels/edges, Large scale, Loop closure, Pose graph optimisation
DSO Engel 2016 Photometric bundle adjustment, Sliding window BA, No loop closure/global optimisation
LDSO Gao 2018 DSO + Loop closure (BoW-based), addresses DSO's main weakness
CNN-SLAM Tateno 2017 Depth from LSD-SLAM + deep depth, Semantic label
DVSO Yang 2018 Deep single image depth estimation, StackNet
Basalt Usenko 2020 Non-linear recovery (→ primarily VIO, see Level 6)
D3VO Yang 2020 Deep single image depth estimation, Deep pose, Deep aleatoric uncertainty

Hybrid (Feature + Direct)

System Author/Year Key Concepts
SVO Forster 2014 FAST feature detection, Direct-based feature tracking, Bundle adjustment
SVO2 Forster 2017 Multi-camera/Fisheye, Probabilistic depth estimation, Direct method convergence, Sparse method
Stereo DSO Wang 2017 → Stereo SLAM
VI-DSO Gao 2018 → VIO/VINS

Learning-based SLAM

System Author/Year Key Concepts
DROID-SLAM Teed 2021 Differentiable BA, dense optical flow, end-to-end learned
TartanVO Wang 2021 Generalizable visual odometry
DPV-SLAM / DPVO Teed 2023 DROID-SLAM lightweight, patch-based visual odometry
MAC-VO Qu 2024 Learning-based VO, metric-aware
VoT Yugay 2025 Visual Odometry with Transformers

Foundation Model SLAM

System Author/Year Key Concepts
DUSt3R Wang 2024 Pointmap regression from image pairs, no calibration needed
MASt3R Leroy 2024 DUSt3R + local feature matching
MASt3R-SLAM Leroy 2024 Real-time dense SLAM from MASt3R
VGGT Wang (Meta) 2025 Feed-forward inference of poses, depths, pointmaps, tracks from N views (CVPR 2025 Best Paper)
VGGT-SLAM 2025 VGGT as frontend for real-time SLAM
VGGT-SLAM 2.0 2026 Improved VGGT-SLAM
VGGT-Geo 2026 Probabilistic geometric fusion of VGGT priors for dense indoor SLAM
IGGT Li 2026 VGGT + VLM — language-grounded 3D geometry
AMB3R Wang 2025 MASt3R frontend + Transformer backend for SfM/SLAM
MASt3R-Fusion WHU 2025 MASt3R-SLAM + IMU + GNSS fusion

SfM Tools

System Author/Year Key Concepts
InstantSfM 2025 GPU-accelerated SfM pipeline, 40× faster than COLMAP

Neural Representation SLAM

NeRF-based

System Author/Year Key Concepts
iMAP Sucar 2021 First NeRF-SLAM, single MLP, real-time tracking/mapping
BARF Lin 2021 Bundle-Adjusting NeRF, coarse-to-fine positional encoding, joint pose+NeRF opt (not full SLAM — pose+NeRF co-optimization)
NICE-SLAM Zhu & Peng 2022 Hierarchical feature grid (coarse/mid/fine), scalable
Co-SLAM Wang 2023 Hash grid (Instant-NGP) + coordinate encoding, 5-10× faster than NICE-SLAM
ESLAM Johari 2023 Tri-plane representation, O(N²) vs O(N³) memory
Point-SLAM Sandström 2023 Neural point cloud based
NeRF-SLAM Rosinol 2023 NeRF + classical SLAM pipeline
NICER-SLAM Zhu 2024 RGB-only NeRF-SLAM (no depth sensor), monocular depth integration
vMAP Kong 2023 Object-level NeRF-SLAM, per-object neural fields
GO-SLAM Zhang 2023 Global optimization + NeRF-SLAM, loop closure + global BA

3DGS-based

System Author/Year Key Concepts
SplaTAM Keetha 2024 First 3DGS-SLAM, RGB-D, silhouette-guided densification
MonoGS Matsuki 2024 Monocular 3DGS-SLAM, depth network + triangulation fusion
GS-ICP SLAM Yu 2024 Gaussian-to-Gaussian ICP (Mahalanobis distance), geometric tracking
Photo-SLAM Huang 2024 Explicit geometry + implicit appearance (MLP color), anti-aliasing
RTG-SLAM 2024 Real-time focus, adaptive Gaussian budget, Jetson Orin 25 FPS
EGG-Fusion ZJU 2025 Gaussian surfel fusion, information-filter-based, real-time 24 FPS
Online-Mono-3DGS (MODP) 2025 ORB-SLAM3 tracking + Hierarchical Gaussian Management
ActiveSplat Li 2025 Active mapping with 3DGS + Voronoi-based path planning
Open-S3SLAM 2026 Open-set semantic 3DGS SLAM for smartphones (ICRA 2026)
LEGS 2025 Language Embedded Gaussian Splats, real-time language-queryable 3D

Semantic / Language-Grounded SLAM

System Author/Year Key Concepts
ConceptFusion Jatavallabhula (MIT) 2023 CLIP features fused into 3D map, open-vocabulary language queries
LERF Kerr 2023 Language Embedded Radiance Fields, DINO multi-scale, NeRF + CLIP
OpenScene Peng (ETH) 2023 Language features back-projected to 3D point clouds
ConceptGraphs Gu 2023 Open-vocabulary 3D Scene Graph, SAM + CLIP + LLM spatial relations
SpatialLLM Mao 2025 Point cloud → LLM, structured indoor modeling as Python scripts

Also see: LEGS, Open-S3SLAM (3DGS-based section above); Open-YOLO 3D (Level 5 Object Detection)


Level 4: RGB-D Visual-SLAM

RGB-D Camera Devices

  • Intel RealSense D series
  • Microsoft Kinect v1/v2
  • Azure Kinect DK
  • Occipital Structure Core
  • Orbbec Astra

GPGPU Programming

  • CUDA, OpenGL GLSL

Systems

System Author/Year Key Concepts
ICP Besl & McKay 1992
DTAM Newcombe 2011
KinectFusion Newcombe 2011 GPGPU, Tracking (project depth → 3D, surface normal, coarse-to-fine ICP), Mapping (volumetric integration, TSDF), Robust to small scene changes, Cannot model deformation, Map growth cubic, Room-size only
Double Window Optimisation Strasdat 2011
Kintinuous Whelan 2012 Volume shift, Geometric, Photometric, dBoW+SURF, Optimisation, Loop closure
RGBD-SLAM-V2 Endres 2013 Tracking (colour image, visual features, depth image, point cloud, transformation), Mapping (OctoMap 2013)
SLAM++ Salas-Moreno 2013 Object-oriented SLAM
DVO Kerl 2013 Keyframe, Depth, Direct method, Optimisation, Loop closure
RTAB-Map Labbé 2014 Loop closure, Map merge, Multi-session memory management
MRS-Map Stuckler 2014
ElasticFusion Whelan 2015 Active: frame-to-model tracking (photometric + geometric), joint optimisation, fused surfel-based model reconstruction · Inactive: local loop closure (model-to-model local surface, submodel separation), global loop closure (randomised fern encoding, non-rigid space deformation)
DynamicFusion Newcombe 2015 6D motion field, Deformable scene
ORB-SLAM2 Mur-Artal 2016 Bundle adjustment, Sparse reconstruction
BundleFusion Dai 2016 Local-to-global optimisation, Sparse RGB feature, Coarse global pose estimation, Fine pose refinement (geometric + photometric)
SemanticFusion McCormac 2016 Deep Learning CNN, Deep Semantic SLAM
InfiniTAM v3 Prisacariu 2017 Tracking (scene raycast, depth image, RGB image), Relocalisation (random ferns), Mapping (TSDF reconstruction, voxel hashing, surfel reconstruction)
Fusion++ McCormac & Clark 2018 Deep Learning CNN, Mask-RCNN instance segmentation, Object-level SLAM, No prior, Object-level TSDF reconstruction
PointFusion / DenseFusion Xu 2018 / Wang 2019 RGB-D object pose estimation, Tracking, Relocalisation, Loop closure detection
BAD SLAM Schops 2019 Direct bundle adjustment, Deep Semantic SLAM
RTAB-Map v2 Labbé 2019 RGB-D/LiDAR, Light-source detection (2016)
MoreFusion Wada & Sucar 2020 DL instance segmentation, Object-level volumetric fusion, Volumetric pose prediction, 3D scene reconstruction, Collision-based refinement, Semantic SLAM, Object pose estimation, CAD object fitting
NodeSLAM Wada & Sucar 2020 Occupancy VAE, Object-level SLAM (→ also in Level 5 Latent Representation)
Kimera / 3D Dynamic Scene Graph Rosinol 2020 Kimera-VIO, Kimera-Mesher, Kimera-PGMO, Kimera-Semantics, Kimera-DSG
DSP-SLAM Wang (UCL) 2021 DeepSDF shape prior + ORB-SLAM2, object-level dense reconstruction (mono/stereo/LiDAR)

Level 5: Applying Deep Learning

Level 5 is organized into four pillars:
A. Frontend — learned perception components replacing hand-crafted modules
B. Backend — learned/certifiable optimization replacing classical solvers
C. Systems — end-to-end deep VO/SLAM pipelines
D. Scene Understanding — semantic, language, and relational reasoning on SLAM maps


A. Deep Frontend — Perception

Feature Detection & Matching

System Author/Year Key Concepts
NetVLAD Arandjelovic 2016 VLAD, place recognition
SuperPoint DeTone 2017 Homographic Adaptation, Self-supervised, VGG encoder + detector/descriptor heads
HardNet Mishchuk 2017 Learned local descriptor
R2D2 Revaud 2019 Repeatable + Reliable detector/descriptor, explicit repeatability/reliability maps
KeyNet Barroso-Laguna 2019 Learned keypoint detector
HF-Net Sarlin 2019 Global feature, Local feature, Visual localization
SuperGlue Sarlin 2020 Self/Cross-attention GNN, Sinkhorn optimal assignment, dustbin for outliers
DISK Tyszkiewicz 2020 Policy gradient (RL) training, match success/failure as reward
Patch NetVLAD Hausler 2021 Multi-scale patch-level VLAD
LoFTR Sun 2021 Detector-free, Transformer coarse-to-fine dense matching
LightGlue Lindenberger 2023 Adaptive depth/width, 5-10× faster than SuperGlue
XFeat Potje 2024 0.3M params, 1400 FPS (RTX 4090), 64-dim descriptor, embedded-friendly
RoMA Edstedt 2024 DINOv2 foundation feature + coarse-to-fine dense matching
DeDoDe Edstedt 2024 Joint detect-and-describe in one stage
RoMA V2 Edstedt 2026 Improved RoMA

Depth Estimation

System Author/Year Key Concepts
MonoDepth Godard 2016 Left-Right photometric consistency, self-supervised
MiDaS Ranftl 2020 Multi-dataset mixing, scale-and-shift invariant loss, relative depth
DPT Ranftl 2021 Dense Prediction Transformer (ViT backbone), global context
ZoeDepth Bhat 2023 Zero-shot metric depth, Metric Bins Module
Metric3D Yin 2023 Camera intrinsic-conditioned metric depth, Canonical Camera Space
Depth Anything Yang 2024 62M images, foundation model for monocular depth
Depth Anything V2 Yang 2024 Improved with synthetic data, better edge preservation
Marigold Ke 2024 Stable Diffusion for depth, fine detail, uncertainty via sampling
Align3r Melou 2025 Video temporal consistency, DUSt3R-based, CVPR 2025 Highlight
Masked Depth Modeling (LingBot-Depth) 2026 Fixes RGB-D failures on glass/mirrors/metal

Optical Flow & Scene Flow

System Author/Year Key Concepts
FlowNet Dosovitskiy 2015 First end-to-end deep optical flow (SimpleNet / CorrNet)
FlowNet 2.0 Ilg 2017 Stacked networks, classical-level accuracy
PWC-Net Sun 2018 Pyramid-Warping-Cost volume, coarse-to-fine, 8.4M params
FlowNet3D Liu 2019 Point cloud scene flow, PointNet++ based
RAFT Teed 2020 All-Pairs Correlation + iterative ConvGRU update, ECCV Best Paper
RAFT-3D Teed 2021 Scene flow (3D motion) from RAFT
FlowFormer Huang 2022 Transformer on cost volume tokens, global context
SEA-RAFT 2024 Efficient RAFT variant for real-time

Camera Pose Regression & Relocalization

System Author/Year Key Concepts
PoseNet Kendall 2015 CNN-based 6-DoF pose regression (APR), GoogLeNet backbone
DSAC Brachmann 2017 Differentiable RANSAC, Scene Coordinate Regression (SCR)
DSAC++ Brachmann 2018 Self-supervision, RGB-D support
CNN Pose Regression Limitations Sattler 2019 Pose regression ≈ image retrieval performance
LM-Reloc von Stumberg 2020 Deep direct relocalization
DSAC* Brachmann 2021 Improved learning stability
ACE Brachmann 2023 Accelerated Coordinate Encoding, 5-min training per scene
ACE Zero Brachmann 2024 Zero-shot SCR, no pre-built 3D map needed
ACE-G Brachmann 2024 Generalizable SCR via cross-attention, new scenes without fine-tuning
ACE-SLAM Tang 2024 Neural implicit real-time SLAM, network weights = map
hloc Sarlin 2019+ Hierarchical Localization: coarse (NetVLAD) → fine (SuperGlue) pipeline

Object Detection & Segmentation for SLAM

System Author/Year Key Concepts
YOLO (v1→v11) Redmon 2016→2024 Real-time object detection, Ultralytics ecosystem
DETR Carion 2020 Transformer detection, anchor-free, no NMS
RT-DETR Lv (Baidu) 2023 Real-time DETR, YOLO-speed + Transformer quality
SAM Kirillov 2023 Segment Anything, prompt-based, Foundation Model
SAM 2 Meta 2024 Video segmentation, Memory Attention, temporal consistency
Grounding DINO Liu 2023 Text-prompted detection → SAM pipeline (Grounded SAM)
Open-YOLO 3D Benseddik 2025 2D open-vocab detection → 3D instance seg, 16× faster

B. Deep Backend — Optimization

Differentiable Bundle Adjustment

System Author/Year Key Concepts
BA-Net Tang 2019 FPN + differentiable LM layer, end-to-end SfM (ICLR)
DROID-SLAM Teed 2021 Dense optical flow + differentiable dense BA, all-pixels reprojection
DPVO Teed 2023 Patch-based DROID-SLAM, 30+ FPS real-time
Theseus Pineda (Meta) 2022 Differentiable nonlinear optimization library (PyTorch)
Lietorch Teed 2021 Lie group operations for PyTorch (SE(3)/SO(3))

Certifiably Optimal Algorithms

System Author/Year Key Concepts
SE-Sync Rosen 2019 Certifiable pose graph optimization via SDP + Riemannian opt
TEASER++ Yang 2020 Point cloud registration, 90%+ outlier robust, TLS + Max Clique (T-RO/RSS 2020)
GNC Yang 2020 Graduated Non-Convexity, continuation from convex → robust cost
QUASAR Yang 2022 Certifiable rotation averaging, SDP + robust cost

Gaussian Belief Propagation & Graph Processors

System Author/Year Key Concepts
FutureMapping 1 Davison 2018 Computational structure of Spatial AI, GBP for SLAM
FutureMapping 2 Ortiz 2019 GBP as core Spatial AI primitive, visual intro to GBP
BA on Graph Processor Ortiz 2020 Bundle Adjustment on Graphcore IPU, tile-based parallelism
DANCeRS 2023 GBP-based distributed consensus in robot swarms

C. End-to-End Deep VO / SLAM Systems

Self-supervised & Learned VO

System Author/Year Key Concepts
DeepVO Wang 2017 Supervised learning
SfM-Learner Zhou 2017 Unsupervised, deep depth + deep pose
DeMoN Ummenhofer 2017 Depth + Motion from two frames, encoder-decoder
UndeepVO Li 2018 Stereo self-supervised, absolute scale recovery
DeepTAM Zhou 2018 Deep tracking and mapping, cost volume based
DeepV2D Teed 2018 Iterative depth from video, differentiable geometry layers
Depth from Video in the Wild Gordon 2019 Unconstrained video depth, learned camera intrinsics
Neural Ray Surfaces Vasiljevic 2020 Learned ray surface model, non-pinhole cameras
GradSLAM Murthy 2020 Differentiable SLAM framework (PyTorch, supports multiple SLAM backends)
DeepSLAM Wang 2020 TrackingNet, MappingNet, LoopNet
MonoRec Wimbauer 2021 Self-supervised monocular 3D reconstruction, moving objects
TANDEM Koestler 2021 Real-time tracking + dense mapping via MVS depth, DSO-based
DROID-SLAM Teed 2021 Dense BA + correlation, SOTA on TartanAir/EuRoC (→ see Differentiable BA)
DPVO Teed 2023 Patch-based lightweight DROID (→ see Differentiable BA)

Latent Representation SLAM

System Author/Year Key Concepts
CodeSLAM Bloesch 2018 Depth as 128-dim latent code, photometric BA on codes + poses
SceneCode Zhi 2019 Depth + semantic in single latent code, cross-modal constraints
DeepFactors Czarnowski 2020 Probabilistic depth codes + factor graph, GPU 30+ FPS
NodeSLAM Sucar 2020 Object-level DeepSDF codes, occupancy VAE per object
CodeMapping Shao 2021 Sparse SLAM + learned dense mapping, hybrid approach

Neural Rendering (reference)

NeRF/3DGS-based SLAM systems → see Level 3: Neural Representation SLAM

System Author/Year Key Concepts
NeRF Mildenhall 2020 Neural Radiance Fields, novel view synthesis (foundational)
DIFIX3D+ 2026 Single-step diffusion for 3D reconstruction artifact removal (post-processing)

D. Scene Understanding

Benchmarks & Foundations

System Author/Year Key Concepts
EFM3D Straub (Meta) 2024 Egocentric Foundation Model 3D benchmark, depth/surface/semantic from ego-video

3D Scene Graph

System Author/Year Key Concepts
Hydra Hughes (MIT SPARK) 2022 Real-time hierarchical Scene Graph (mesh→objects→places→rooms→buildings)
Hydra-Multi Hughes 2023 Distributed multi-robot 3D Scene Graph
Clio Maggio (MIT SPARK) 2024 Open-set task-driven Scene Graph, CLIP embeddings per node
Khronos Schmid (MIT SPARK) 2024 Spatio-temporal Scene Graph, dynamic object history tracking
ConceptGraphs Gu 2023 Open-vocabulary 3D Scene Graph, SAM + CLIP + LLM relations (→ also in L3 Semantic)

Level 6: VIO / VINS

Key Concepts

  • Tightly-coupled vs Loosely-coupled — Joint vs separate optimization of visual and inertial measurements
  • Filter-based vs Optimization-based — EKF approaches vs nonlinear optimization (BA)
  • IMU preintegration — On-manifold IMU integration between keyframes (Forster 2015)
  • IMU noise model — Bias, random walk, Allan variance
  • Observability — Yaw and global position are unobservable in VIO

Foundations

Resource Author/Year Key Concepts
📖 Introduction to Inertial Navigation Woodman 2007 IMU fundamentals, coordinate frames, error sources — essential prerequisite
IMU Preintegration on Manifold Forster 2015 On-manifold preintegration, bias correction without re-integration
Quaternion kinematics for error-state KF Sola 2017 Quaternion math, error-state formulation

Filter-based

System Author/Year Key Concepts
MSCKF Mourikis 2007 Multi-State Constraint KF, efficient VIO without landmarks in state
ROVIO Bloesch 2015 Robocentric VIO, direct photometric tracking + EKF
OpenVINS Geneva 2020 Open-source MSCKF, modular, extensible

Optimization-based

System Author/Year Key Concepts
OKVIS Leutenegger 2015 Keyframe-based, tightly-coupled, sliding window optimization
VINS-Mono Qin 2018 Tightly-coupled, relocalization, loop closure, pose graph optimization
VINS-Fusion Qin 2019 Stereo + GPS fusion extension
MAPLAB Schneider 2018 Multi-session visual-inertial mapping framework
Kimera-VIO Rosinol 2020 Fast VIO frontend for Kimera pipeline, structureless vision factors
Basalt Usenko 2020 Non-linear recovery, visual-inertial odometry + mapping
ORB-SLAM3 Campos 2020 VIO mode, multi-map, IMU initialization
DM-VIO von Stumberg 2022 Deep monocular VIO, delayed marginalization
OKVIS2 Leutenegger 2022 Multi-session, improved marginalization
AirVO Xu 2023 Point-line VIO, illumination-robust
OKVIS2-X Boche & Leutenegger 2025 Multi-sensor SLAM (Visual+Inertial+Depth+LiDAR+GNSS), dense volumetric occupancy maps, submapping for large-scale (9km+), EuRoC/Hilti22 SOTA

Level 7: World Models & Spatial AI

World Models

System Author/Year Key Concepts
GAIA-1 Wayve 2023 Driving World Model, action-conditioned future scene generation
Sora / DiT OpenAI 2024 Diffusion Transformer, spacetime patches, emergent 3D understanding
NVIDIA Cosmos NVIDIA 2026 World Foundation Model platform for Physical AI, synthetic data for AV/robots
World Labs / Marble Fei-Fei Li 2026 3D world generation from images/video/text ($1B funding)
WorldVLA Alibaba 2025 Autoregressive action world model, learns physics for action generation
SceneDINO 2025 Feed-forward unsupervised semantic scene completion

Generative 3D

System Author/Year Key Concepts
DreamFusion Poole 2023 Text-to-3D via Score Distillation Sampling (SDS) + NeRF

Vision-Language Models (VLM)

System Author/Year Key Concepts
CLIP Radford (OpenAI) 2021 Contrastive image-text pretraining, 400M pairs, zero-shot
SigLIP Zhai (Google) 2023 Sigmoid loss CLIP, more efficient, better at small model sizes
BLIP-2 Li (Salesforce) 2023 Q-Former bridges frozen LLM + image encoder
LLaVA Liu 2023 LLaMA + vision, conversational VLM

Vision-Language-Action Models (VLA)

System Author/Year Key Concepts
RT-2 Brohan (DeepMind) 2023 Robot actions as text tokens, emergent generalization
OpenVLA Kim 2024 Open-source VLA, SigLIP + Llama 7B + Action Head
Navila 2024 Navigation-specialized VLA, SLAM integration for localization

Resources

Resource Key Concepts
Awesome-Transformer-based-SLAM Curated GitHub list of Transformer-based SLAM methods

Level 8: Stereo SLAM

Key Concepts

  • Stereo rectification — Epipolar alignment for efficient disparity search
  • Disparity vs Depth — d = f·B/Z, baseline determines depth range/accuracy
  • Scale observability — Stereo provides metric scale (unlike monocular)

Systems

System Author/Year Key Concepts
S-PTAM Pire 2017 Stereo PTAM, ROS-compatible, real-time
ORB-SLAM2 (stereo) Mur-Artal 2016 Stereo + RGB-D modes, loop closure, relocalization
StereoMSCKF Sun 2018 MSCKF with stereo, efficient for resource-constrained platforms
RTAB-Map Labbé 2019 Multi-sensor (stereo/RGB-D/LiDAR), memory management, large-scale
ORB-SLAM3 (stereo) Campos 2020 Multi-map, Atlas, stereo + IMU
Stella-VSLAM Community 2022 Open-source fork of OpenVSLAM, stereo support
LDSO Gao 2018 Direct stereo odometry with loop closure (DSO extension)

Level 9: Collaborative / Multi-Robot SLAM

Key Concepts

  • Centralized vs Decentralized — Single server vs peer-to-peer map merging
  • Inter-robot loop closure — Place recognition across robots with different viewpoints
  • Communication constraints — Bandwidth-limited map sharing, sparse descriptors
  • Map merging — Aligning submaps from different robots into a global map

Systems

System Author/Year Key Concepts
C2TAM Riazuelo 2014 Cloud-based collaborative monocular SLAM
CCM-SLAM Schmuck & Chli 2019 Centralized collaborative monocular SLAM, robust to comm failures
DOOR-SLAM Lajoie 2020 Distributed, outlier-resilient SLAM with pairwise consistency
Kimera-Multi Tian 2022 Distributed multi-robot metric-semantic SLAM, mesh reconstruction
Swarm-SLAM Lajoie 2024 Decentralized, sparse, scalable C-SLAM, supports LiDAR/stereo/RGB-D
CoPeD-Advancing Stathoulopoulos 2024 Multi-robot collaborative perception for autonomous exploration
MAPLAB 2.0 Cramariuc 2023 Multi-session, multi-robot visual-inertial mapping

Level 10: LiDAR & Visual-LiDAR Fusion SLAM

Key Concepts

  • LiDAR-Visual-Inertial (LVI) — Triple fusion for robust outdoor SLAM
  • Tightly-coupled LiDAR-camera — Joint optimization of point cloud and visual features
  • Direct LiDAR-camera alignment — Photometric/geometric alignment without feature extraction
  • Degradation handling — Graceful fallback when one modality fails (e.g., LiDAR in rain, camera in darkness)
  • Range image — 2D projection of LiDAR scans for efficient processing (SuMa, RangeNet++)

LiDAR / LiDAR-Inertial SLAM

System Author/Year Key Concepts
LOAM Zhang 2014 LiDAR odometry and mapping (foundational), edge + planar features
SuMa Behley (Bonn) 2018 Surfel-based LiDAR SLAM, projective ICP on range images
SuMa++ Chen (Bonn) 2019 SuMa + RangeNet++ semantics, semantic ICP weighting, dynamic object filtering
LIO-SAM Shan 2020 Tightly-coupled LiDAR-inertial, factor graph, GPS fusion
FAST-LIO2 Xu 2022 Direct LiDAR-inertial, ikd-Tree, extremely fast
PIN-SLAM Pan (Bonn) 2024 Neural point cloud LiDAR SLAM, point-to-SDF registration, elastic map deformation for loop closure

Visual-LiDAR Fusion SLAM

System Author/Year Key Concepts
LVI-SAM Shan 2021 LiDAR-Visual-Inertial via factor graph, LIO-SAM + VINS-Mono
R3LIVE Lin 2022 Real-time LiDAR-Visual-Inertial, dense RGB point cloud map
R3LIVE++ Lin 2023 Improved R3LIVE with mesh reconstruction
FAST-LIVO Zheng 2022 FAST-LIO + direct visual odometry, tightly-coupled LVI
FAST-LIVO2 Zheng 2024 Improved, sequential image processing, direct photometric fusion
OKVIS2-X Boche 2025 Visual+Inertial+Depth+LiDAR+GNSS configurable (also in Level 6)

Resources

Resource Key Concepts
LiDAR-Visual-Inertial Survey (Zheng 2024) Comprehensive survey of LVI SLAM systems

Level 11: Event Camera SLAM

Key Concepts

  • Event cameras (DVS) — Asynchronous per-pixel brightness change detection, μs temporal resolution
  • Advantages — HDR (140dB+), no motion blur, low latency, low power
  • Challenges — No absolute intensity, sparse asynchronous output, requires new algorithms
  • Event representations — Event frames, time surfaces, voxel grids, spike tensors

Foundations

Resource Author/Year Key Concepts
📖 Event-based Vision Survey Gallego 2020 Comprehensive survey of event camera algorithms
Awesome-Event-based-SLAM KwanWaiPang Curated GitHub list of event-based SLAM papers

Systems

System Author/Year Key Concepts
EVO Rebecq 2017 Event-based Visual Odometry, 3D reconstruction from events
ESVO Zhou 2021 Event-based Stereo Visual Odometry
Ultimate-SLAM Vidal 2018 Events + frames + IMU fusion
EKLT Gehrig 2020 Event-based KLT feature tracking
ESVIO Chen 2023 Event-based Stereo VIO
EDS Hidalgo-Carrió 2022 Event-aided direct sparse odometry
DEVO Pellerito 2024 Deep event-based visual odometry (DROID-SLAM style)
VIO-GO 2025 Event-based VIO with optimized parameters for HDR scenarios

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions