-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Description
📘 Study Roadmap : Visual-SLAM (Beginner → Master)
Created: 2026-02-22
Level 1: Beginner
Programming
- C++: Pointer, OOP
- Python
- Bash/Linux: Basic terminal usage
Mathematics
- Basic Probability & Statistics: Gaussian distribution, Bayes' theorem
- Basic Linear Algebra: Vectors & Matrices, Determinant, Dot & Cross product, Rank, Inverse matrix, Transpose matrix, SVD, Eigenvalues/Eigenvectors
- Logarithm & Exponential
- Basic Calculus: Differentiation, Taylor expansion
Projective Geometry
- Pinhole camera model → Image projection
- Camera calibration: Intrinsic/Extrinsic parameters, Lens distortion
- Rigid body motion: Euler/Quaternion/Rotation Matrix, Projective space & Vanishing point, Homogeneous transformation
- Epipolar geometry → Essential & Fundamental matrix
- Triangulation
Camera Device
- Lens, Sensor, Resolution/ISO/Aperture
Image Data
- Colour image, Resolution, Grayscale image
- Thresholding, Gaussian blur
- Corner detector: Harris corner
- Edge detector: Sobel & Canny Edge
- Stereovision, RGB-D, Disparity, Depth
Level 2: Getting Familiar with SLAM
Programming
- C++: OOP, Modern C++, Data structures & Algorithms, Compilers, CMake/Makefile/Ninja, Design patterns, OpenCV C++
- C
- Git/GitHub
- OpenCV (opencv-python)
- Python: Deep learning, Graph plots, System scripts
- Bash/Linux: ssh, CLI text editor/Vim/tmux
- Concurrency: SIMD-SSE/AVX/Neon, OpenMP, CUDA
- Mobile: Android (Java/Kotlin), iOS (Objective-C/Swift)
- Maths library: Eigen, Ceres-solver/GTSAM/g2o
- C++/Python interop: PyBind11, nanobind
- Docker
- C#: COLMAP, Unity AR, Microsoft Hololens
- CI/CD: GitHub Actions, Apache Airflow
- ROS/ROS2
- Simulation: Gazebo, Isaac Sim
Image Processing
- Keypoints → Detector/Descriptor
- SIFT, FAST, ORB, AKAZE
- Deep features: R2D2, Superpoint
- Image pyramid, oFAST, rBRIEF
Local Feature Matching
- Brute-Force, FLANN, Kd-Tree
- LSH, Multi-probe LSH, HBST
- Superglue
Global Feature Matching
- Bag of Visual Words, NetVLAD
- Deep image retrieval, Hierarchical localization
Feature Tracking
- Optical flow, KLT Tracker
Multiple View Geometry
- 2D-2D correspondence: Essential/Fundamental, Homography
- 2D-3D correspondence: P3P, PnP, SVD
- 3D-3D correspondence: ICP
Outlier Rejection
- RANSAC, PROSAC, M-Estimator, MAXCON, Convex relaxation
Least Squares Optimisation
- Reprojection error, Bundle adjustment
- Non-linear optimisation, Lie algebra
- Lie groups: SO(3), SE(3)
- Gauss-Newton, Levenberg-Marquardt
- Pose graph optimization
- Schur complement / Sparsity
Motion Model
- Proprioceptive sensor: IMU, Wheel
- Odometry (pose)
Observation Model
- Exteroceptive sensor: Camera, LiDAR
- Landmark (Map)
- Joint optimisation, MLE & MAP
Factor Graph Optimisation
Mapping
- Point cloud, Occupancy grid mapping, TSDF, Surfel, Voxel map
Sensors
- Camera device: Wide/telecentric lens, Lens MTF, CCD/CMOS, Rolling/Global shutter, Exposure/ISO, Stereovision, RGB-D, Structured light, Active IR/ToF
- LiDAR → Visual-LiDAR fusion
- IMU → VIO
- RADAR → Sensor fusion, Extended Kalman filter
- Sonar
- Multi-sensor calibration: Camera-IMU, Camera-LiDAR
Evaluation
- Metrics: ATE (Absolute Trajectory Error), RPE (Relative Pose Error)
- Datasets: KITTI, TUM RGB-D, EuRoC
Next Levels
Monocular SLAM · VIO/VINS · Stereo SLAM · Visual-LiDAR Fusion · RGB-D SLAM · Collaborative SLAM · Deep SLAM/Localization
Level 3: Monocular Visual-SLAM
Key Concepts
- VO vs SLAM — VO is local (no loop closure), SLAM includes global map + loop closure
- Scale ambiguity — Fundamental limitation of monocular SLAM; absolute scale is unrecoverable from images alone
- Covisibility graph — Shared map point visibility between keyframes; core data structure in ORB-SLAM
- Visual Place Recognition (VPR) — Recognising previously visited places for loop closure
- Self-supervised depth — Learning monocular depth without ground truth (Monodepth2, Godard 2019)
Feature-based SLAM
| System | Author/Year | Key Concepts |
|---|---|---|
| Visual Odometry | Nister 2004 | Fundamental matrix, Triangulation, VO (local-only, no loop closure) |
| PTAM | Klein & Murray 2007 | FAST feature, Tracking, Frontend/Backend separation, Parallel threads, Keyframe, Mapping, Bundle adjustment, Manual initialisation |
| Visual-SLAM why filter? | Strasdat 2012 | Bundle adjustment, Scale-aware BA, Motion-only BA |
| ORB-SLAM | Mur-Artal 2015 | ORB keypoint, Automatic initialisation (Homography vs Fundamental selection), Tracking thread, Sliding-window BA, Local mapping, Large-scale, Loop closure, Bag of visual words, Global optimisation, Covisibility graph, Map point management (culling, merging) |
| Pop-up SLAM | Yang 2016 | Line/Plane features |
| PL-SLAM | Pumarola 2017 | Point/Line features |
| ORB-SLAM2 | Mur-Artal 2017 | → Stereo SLAM, → RGB-D SLAM |
| CubeSLAM | Yang 2019 | Monocular 3D cuboid detection + SLAM, 9-DoF object representation |
| OpenVSLAM | Sumikura 2019 | — |
| Stella-VSLAM | (fork) 2021 | OpenVSLAM successor, license reboot |
| UcoSLAM | Munoz-Salinas 2019 | Fiducial markers |
| DeepFusion | LaidLow 2019 | — |
| ORB-SLAM3 | Campos 2020 | Monocular + Stereo + VIO, Multi-map, IMU integration |
| DXSLAM | Li 2020 | Deep features for SLAM |
| PyCuVSLAM | NVIDIA 2026 | Python + CUDA GPU-accelerated VSLAM toolkit (cuVSLAM wrapper) |
Direct SLAM
| System | Author/Year | Key Concepts |
|---|---|---|
| DTAM | Newcombe 2011 | Dense mapping, Keyframe mapping, GPGPU |
| LSD-SLAM | Engel 2014 | Photometric error minimisation, High gradient pixels/edges, Large scale, Loop closure, Pose graph optimisation |
| DSO | Engel 2016 | Photometric bundle adjustment, Sliding window BA, No loop closure/global optimisation |
| LDSO | Gao 2018 | DSO + Loop closure (BoW-based), addresses DSO's main weakness |
| CNN-SLAM | Tateno 2017 | Depth from LSD-SLAM + deep depth, Semantic label |
| DVSO | Yang 2018 | Deep single image depth estimation, StackNet |
| Basalt | Usenko 2020 | Non-linear recovery (→ primarily VIO, see Level 6) |
| D3VO | Yang 2020 | Deep single image depth estimation, Deep pose, Deep aleatoric uncertainty |
Hybrid (Feature + Direct)
| System | Author/Year | Key Concepts |
|---|---|---|
| SVO | Forster 2014 | FAST feature detection, Direct-based feature tracking, Bundle adjustment |
| SVO2 | Forster 2017 | Multi-camera/Fisheye, Probabilistic depth estimation, Direct method convergence, Sparse method |
| Stereo DSO | Wang 2017 | → Stereo SLAM |
| VI-DSO | Gao 2018 | → VIO/VINS |
Learning-based SLAM
| System | Author/Year | Key Concepts |
|---|---|---|
| DROID-SLAM | Teed 2021 | Differentiable BA, dense optical flow, end-to-end learned |
| TartanVO | Wang 2021 | Generalizable visual odometry |
| DPV-SLAM / DPVO | Teed 2023 | DROID-SLAM lightweight, patch-based visual odometry |
| MAC-VO | Qu 2024 | Learning-based VO, metric-aware |
| VoT | Yugay 2025 | Visual Odometry with Transformers |
Foundation Model SLAM
| System | Author/Year | Key Concepts |
|---|---|---|
| DUSt3R | Wang 2024 | Pointmap regression from image pairs, no calibration needed |
| MASt3R | Leroy 2024 | DUSt3R + local feature matching |
| MASt3R-SLAM | Leroy 2024 | Real-time dense SLAM from MASt3R |
| VGGT | Wang (Meta) 2025 | Feed-forward inference of poses, depths, pointmaps, tracks from N views (CVPR 2025 Best Paper) |
| VGGT-SLAM | 2025 | VGGT as frontend for real-time SLAM |
| VGGT-SLAM 2.0 | 2026 | Improved VGGT-SLAM |
| VGGT-Geo | 2026 | Probabilistic geometric fusion of VGGT priors for dense indoor SLAM |
| IGGT | Li 2026 | VGGT + VLM — language-grounded 3D geometry |
| AMB3R | Wang 2025 | MASt3R frontend + Transformer backend for SfM/SLAM |
| MASt3R-Fusion | WHU 2025 | MASt3R-SLAM + IMU + GNSS fusion |
SfM Tools
| System | Author/Year | Key Concepts |
|---|---|---|
| InstantSfM | 2025 | GPU-accelerated SfM pipeline, 40× faster than COLMAP |
Neural Representation SLAM
NeRF-based
| System | Author/Year | Key Concepts |
|---|---|---|
| iMAP | Sucar 2021 | First NeRF-SLAM, single MLP, real-time tracking/mapping |
| BARF | Lin 2021 | Bundle-Adjusting NeRF, coarse-to-fine positional encoding, joint pose+NeRF opt (not full SLAM — pose+NeRF co-optimization) |
| NICE-SLAM | Zhu & Peng 2022 | Hierarchical feature grid (coarse/mid/fine), scalable |
| Co-SLAM | Wang 2023 | Hash grid (Instant-NGP) + coordinate encoding, 5-10× faster than NICE-SLAM |
| ESLAM | Johari 2023 | Tri-plane representation, O(N²) vs O(N³) memory |
| Point-SLAM | Sandström 2023 | Neural point cloud based |
| NeRF-SLAM | Rosinol 2023 | NeRF + classical SLAM pipeline |
| NICER-SLAM | Zhu 2024 | RGB-only NeRF-SLAM (no depth sensor), monocular depth integration |
| vMAP | Kong 2023 | Object-level NeRF-SLAM, per-object neural fields |
| GO-SLAM | Zhang 2023 | Global optimization + NeRF-SLAM, loop closure + global BA |
3DGS-based
| System | Author/Year | Key Concepts |
|---|---|---|
| SplaTAM | Keetha 2024 | First 3DGS-SLAM, RGB-D, silhouette-guided densification |
| MonoGS | Matsuki 2024 | Monocular 3DGS-SLAM, depth network + triangulation fusion |
| GS-ICP SLAM | Yu 2024 | Gaussian-to-Gaussian ICP (Mahalanobis distance), geometric tracking |
| Photo-SLAM | Huang 2024 | Explicit geometry + implicit appearance (MLP color), anti-aliasing |
| RTG-SLAM | 2024 | Real-time focus, adaptive Gaussian budget, Jetson Orin 25 FPS |
| EGG-Fusion | ZJU 2025 | Gaussian surfel fusion, information-filter-based, real-time 24 FPS |
| Online-Mono-3DGS (MODP) | 2025 | ORB-SLAM3 tracking + Hierarchical Gaussian Management |
| ActiveSplat | Li 2025 | Active mapping with 3DGS + Voronoi-based path planning |
| Open-S3SLAM | 2026 | Open-set semantic 3DGS SLAM for smartphones (ICRA 2026) |
| LEGS | 2025 | Language Embedded Gaussian Splats, real-time language-queryable 3D |
Semantic / Language-Grounded SLAM
| System | Author/Year | Key Concepts |
|---|---|---|
| ConceptFusion | Jatavallabhula (MIT) 2023 | CLIP features fused into 3D map, open-vocabulary language queries |
| LERF | Kerr 2023 | Language Embedded Radiance Fields, DINO multi-scale, NeRF + CLIP |
| OpenScene | Peng (ETH) 2023 | Language features back-projected to 3D point clouds |
| ConceptGraphs | Gu 2023 | Open-vocabulary 3D Scene Graph, SAM + CLIP + LLM spatial relations |
| SpatialLLM | Mao 2025 | Point cloud → LLM, structured indoor modeling as Python scripts |
Also see: LEGS, Open-S3SLAM (3DGS-based section above); Open-YOLO 3D (Level 5 Object Detection)
Level 4: RGB-D Visual-SLAM
RGB-D Camera Devices
- Intel RealSense D series
- Microsoft Kinect v1/v2
- Azure Kinect DK
- Occipital Structure Core
- Orbbec Astra
GPGPU Programming
- CUDA, OpenGL GLSL
Systems
| System | Author/Year | Key Concepts |
|---|---|---|
| ICP | Besl & McKay 1992 | — |
| DTAM | Newcombe 2011 | — |
| KinectFusion | Newcombe 2011 | GPGPU, Tracking (project depth → 3D, surface normal, coarse-to-fine ICP), Mapping (volumetric integration, TSDF), Robust to small scene changes, Cannot model deformation, Map growth cubic, Room-size only |
| Double Window Optimisation | Strasdat 2011 | — |
| Kintinuous | Whelan 2012 | Volume shift, Geometric, Photometric, dBoW+SURF, Optimisation, Loop closure |
| RGBD-SLAM-V2 | Endres 2013 | Tracking (colour image, visual features, depth image, point cloud, transformation), Mapping (OctoMap 2013) |
| SLAM++ | Salas-Moreno 2013 | Object-oriented SLAM |
| DVO | Kerl 2013 | Keyframe, Depth, Direct method, Optimisation, Loop closure |
| RTAB-Map | Labbé 2014 | Loop closure, Map merge, Multi-session memory management |
| MRS-Map | Stuckler 2014 | — |
| ElasticFusion | Whelan 2015 | Active: frame-to-model tracking (photometric + geometric), joint optimisation, fused surfel-based model reconstruction · Inactive: local loop closure (model-to-model local surface, submodel separation), global loop closure (randomised fern encoding, non-rigid space deformation) |
| DynamicFusion | Newcombe 2015 | 6D motion field, Deformable scene |
| ORB-SLAM2 | Mur-Artal 2016 | Bundle adjustment, Sparse reconstruction |
| BundleFusion | Dai 2016 | Local-to-global optimisation, Sparse RGB feature, Coarse global pose estimation, Fine pose refinement (geometric + photometric) |
| SemanticFusion | McCormac 2016 | Deep Learning CNN, Deep Semantic SLAM |
| InfiniTAM v3 | Prisacariu 2017 | Tracking (scene raycast, depth image, RGB image), Relocalisation (random ferns), Mapping (TSDF reconstruction, voxel hashing, surfel reconstruction) |
| Fusion++ | McCormac & Clark 2018 | Deep Learning CNN, Mask-RCNN instance segmentation, Object-level SLAM, No prior, Object-level TSDF reconstruction |
| PointFusion / DenseFusion | Xu 2018 / Wang 2019 | RGB-D object pose estimation, Tracking, Relocalisation, Loop closure detection |
| BAD SLAM | Schops 2019 | Direct bundle adjustment, Deep Semantic SLAM |
| RTAB-Map v2 | Labbé 2019 | RGB-D/LiDAR, Light-source detection (2016) |
| MoreFusion | Wada & Sucar 2020 | DL instance segmentation, Object-level volumetric fusion, Volumetric pose prediction, 3D scene reconstruction, Collision-based refinement, Semantic SLAM, Object pose estimation, CAD object fitting |
| NodeSLAM | Wada & Sucar 2020 | Occupancy VAE, Object-level SLAM (→ also in Level 5 Latent Representation) |
| Kimera / 3D Dynamic Scene Graph | Rosinol 2020 | Kimera-VIO, Kimera-Mesher, Kimera-PGMO, Kimera-Semantics, Kimera-DSG |
| DSP-SLAM | Wang (UCL) 2021 | DeepSDF shape prior + ORB-SLAM2, object-level dense reconstruction (mono/stereo/LiDAR) |
Level 5: Applying Deep Learning
Level 5 is organized into four pillars:
A. Frontend — learned perception components replacing hand-crafted modules
B. Backend — learned/certifiable optimization replacing classical solvers
C. Systems — end-to-end deep VO/SLAM pipelines
D. Scene Understanding — semantic, language, and relational reasoning on SLAM maps
A. Deep Frontend — Perception
Feature Detection & Matching
| System | Author/Year | Key Concepts |
|---|---|---|
| NetVLAD | Arandjelovic 2016 | VLAD, place recognition |
| SuperPoint | DeTone 2017 | Homographic Adaptation, Self-supervised, VGG encoder + detector/descriptor heads |
| HardNet | Mishchuk 2017 | Learned local descriptor |
| R2D2 | Revaud 2019 | Repeatable + Reliable detector/descriptor, explicit repeatability/reliability maps |
| KeyNet | Barroso-Laguna 2019 | Learned keypoint detector |
| HF-Net | Sarlin 2019 | Global feature, Local feature, Visual localization |
| SuperGlue | Sarlin 2020 | Self/Cross-attention GNN, Sinkhorn optimal assignment, dustbin for outliers |
| DISK | Tyszkiewicz 2020 | Policy gradient (RL) training, match success/failure as reward |
| Patch NetVLAD | Hausler 2021 | Multi-scale patch-level VLAD |
| LoFTR | Sun 2021 | Detector-free, Transformer coarse-to-fine dense matching |
| LightGlue | Lindenberger 2023 | Adaptive depth/width, 5-10× faster than SuperGlue |
| XFeat | Potje 2024 | 0.3M params, 1400 FPS (RTX 4090), 64-dim descriptor, embedded-friendly |
| RoMA | Edstedt 2024 | DINOv2 foundation feature + coarse-to-fine dense matching |
| DeDoDe | Edstedt 2024 | Joint detect-and-describe in one stage |
| RoMA V2 | Edstedt 2026 | Improved RoMA |
Depth Estimation
| System | Author/Year | Key Concepts |
|---|---|---|
| MonoDepth | Godard 2016 | Left-Right photometric consistency, self-supervised |
| MiDaS | Ranftl 2020 | Multi-dataset mixing, scale-and-shift invariant loss, relative depth |
| DPT | Ranftl 2021 | Dense Prediction Transformer (ViT backbone), global context |
| ZoeDepth | Bhat 2023 | Zero-shot metric depth, Metric Bins Module |
| Metric3D | Yin 2023 | Camera intrinsic-conditioned metric depth, Canonical Camera Space |
| Depth Anything | Yang 2024 | 62M images, foundation model for monocular depth |
| Depth Anything V2 | Yang 2024 | Improved with synthetic data, better edge preservation |
| Marigold | Ke 2024 | Stable Diffusion for depth, fine detail, uncertainty via sampling |
| Align3r | Melou 2025 | Video temporal consistency, DUSt3R-based, CVPR 2025 Highlight |
| Masked Depth Modeling (LingBot-Depth) | 2026 | Fixes RGB-D failures on glass/mirrors/metal |
Optical Flow & Scene Flow
| System | Author/Year | Key Concepts |
|---|---|---|
| FlowNet | Dosovitskiy 2015 | First end-to-end deep optical flow (SimpleNet / CorrNet) |
| FlowNet 2.0 | Ilg 2017 | Stacked networks, classical-level accuracy |
| PWC-Net | Sun 2018 | Pyramid-Warping-Cost volume, coarse-to-fine, 8.4M params |
| FlowNet3D | Liu 2019 | Point cloud scene flow, PointNet++ based |
| RAFT | Teed 2020 | All-Pairs Correlation + iterative ConvGRU update, ECCV Best Paper |
| RAFT-3D | Teed 2021 | Scene flow (3D motion) from RAFT |
| FlowFormer | Huang 2022 | Transformer on cost volume tokens, global context |
| SEA-RAFT | 2024 | Efficient RAFT variant for real-time |
Camera Pose Regression & Relocalization
| System | Author/Year | Key Concepts |
|---|---|---|
| PoseNet | Kendall 2015 | CNN-based 6-DoF pose regression (APR), GoogLeNet backbone |
| DSAC | Brachmann 2017 | Differentiable RANSAC, Scene Coordinate Regression (SCR) |
| DSAC++ | Brachmann 2018 | Self-supervision, RGB-D support |
| CNN Pose Regression Limitations | Sattler 2019 | Pose regression ≈ image retrieval performance |
| LM-Reloc | von Stumberg 2020 | Deep direct relocalization |
| DSAC* | Brachmann 2021 | Improved learning stability |
| ACE | Brachmann 2023 | Accelerated Coordinate Encoding, 5-min training per scene |
| ACE Zero | Brachmann 2024 | Zero-shot SCR, no pre-built 3D map needed |
| ACE-G | Brachmann 2024 | Generalizable SCR via cross-attention, new scenes without fine-tuning |
| ACE-SLAM | Tang 2024 | Neural implicit real-time SLAM, network weights = map |
| hloc | Sarlin 2019+ | Hierarchical Localization: coarse (NetVLAD) → fine (SuperGlue) pipeline |
Object Detection & Segmentation for SLAM
| System | Author/Year | Key Concepts |
|---|---|---|
| YOLO (v1→v11) | Redmon 2016→2024 | Real-time object detection, Ultralytics ecosystem |
| DETR | Carion 2020 | Transformer detection, anchor-free, no NMS |
| RT-DETR | Lv (Baidu) 2023 | Real-time DETR, YOLO-speed + Transformer quality |
| SAM | Kirillov 2023 | Segment Anything, prompt-based, Foundation Model |
| SAM 2 | Meta 2024 | Video segmentation, Memory Attention, temporal consistency |
| Grounding DINO | Liu 2023 | Text-prompted detection → SAM pipeline (Grounded SAM) |
| Open-YOLO 3D | Benseddik 2025 | 2D open-vocab detection → 3D instance seg, 16× faster |
B. Deep Backend — Optimization
Differentiable Bundle Adjustment
| System | Author/Year | Key Concepts |
|---|---|---|
| BA-Net | Tang 2019 | FPN + differentiable LM layer, end-to-end SfM (ICLR) |
| DROID-SLAM | Teed 2021 | Dense optical flow + differentiable dense BA, all-pixels reprojection |
| DPVO | Teed 2023 | Patch-based DROID-SLAM, 30+ FPS real-time |
| Theseus | Pineda (Meta) 2022 | Differentiable nonlinear optimization library (PyTorch) |
| Lietorch | Teed 2021 | Lie group operations for PyTorch (SE(3)/SO(3)) |
Certifiably Optimal Algorithms
| System | Author/Year | Key Concepts |
|---|---|---|
| SE-Sync | Rosen 2019 | Certifiable pose graph optimization via SDP + Riemannian opt |
| TEASER++ | Yang 2020 | Point cloud registration, 90%+ outlier robust, TLS + Max Clique (T-RO/RSS 2020) |
| GNC | Yang 2020 | Graduated Non-Convexity, continuation from convex → robust cost |
| QUASAR | Yang 2022 | Certifiable rotation averaging, SDP + robust cost |
Gaussian Belief Propagation & Graph Processors
| System | Author/Year | Key Concepts |
|---|---|---|
| FutureMapping 1 | Davison 2018 | Computational structure of Spatial AI, GBP for SLAM |
| FutureMapping 2 | Ortiz 2019 | GBP as core Spatial AI primitive, visual intro to GBP |
| BA on Graph Processor | Ortiz 2020 | Bundle Adjustment on Graphcore IPU, tile-based parallelism |
| DANCeRS | 2023 | GBP-based distributed consensus in robot swarms |
C. End-to-End Deep VO / SLAM Systems
Self-supervised & Learned VO
| System | Author/Year | Key Concepts |
|---|---|---|
| DeepVO | Wang 2017 | Supervised learning |
| SfM-Learner | Zhou 2017 | Unsupervised, deep depth + deep pose |
| DeMoN | Ummenhofer 2017 | Depth + Motion from two frames, encoder-decoder |
| UndeepVO | Li 2018 | Stereo self-supervised, absolute scale recovery |
| DeepTAM | Zhou 2018 | Deep tracking and mapping, cost volume based |
| DeepV2D | Teed 2018 | Iterative depth from video, differentiable geometry layers |
| Depth from Video in the Wild | Gordon 2019 | Unconstrained video depth, learned camera intrinsics |
| Neural Ray Surfaces | Vasiljevic 2020 | Learned ray surface model, non-pinhole cameras |
| GradSLAM | Murthy 2020 | Differentiable SLAM framework (PyTorch, supports multiple SLAM backends) |
| DeepSLAM | Wang 2020 | TrackingNet, MappingNet, LoopNet |
| MonoRec | Wimbauer 2021 | Self-supervised monocular 3D reconstruction, moving objects |
| TANDEM | Koestler 2021 | Real-time tracking + dense mapping via MVS depth, DSO-based |
| DROID-SLAM | Teed 2021 | Dense BA + correlation, SOTA on TartanAir/EuRoC (→ see Differentiable BA) |
| DPVO | Teed 2023 | Patch-based lightweight DROID (→ see Differentiable BA) |
Latent Representation SLAM
| System | Author/Year | Key Concepts |
|---|---|---|
| CodeSLAM | Bloesch 2018 | Depth as 128-dim latent code, photometric BA on codes + poses |
| SceneCode | Zhi 2019 | Depth + semantic in single latent code, cross-modal constraints |
| DeepFactors | Czarnowski 2020 | Probabilistic depth codes + factor graph, GPU 30+ FPS |
| NodeSLAM | Sucar 2020 | Object-level DeepSDF codes, occupancy VAE per object |
| CodeMapping | Shao 2021 | Sparse SLAM + learned dense mapping, hybrid approach |
Neural Rendering (reference)
NeRF/3DGS-based SLAM systems → see Level 3: Neural Representation SLAM
| System | Author/Year | Key Concepts |
|---|---|---|
| NeRF | Mildenhall 2020 | Neural Radiance Fields, novel view synthesis (foundational) |
| DIFIX3D+ | 2026 | Single-step diffusion for 3D reconstruction artifact removal (post-processing) |
D. Scene Understanding
Benchmarks & Foundations
| System | Author/Year | Key Concepts |
|---|---|---|
| EFM3D | Straub (Meta) 2024 | Egocentric Foundation Model 3D benchmark, depth/surface/semantic from ego-video |
3D Scene Graph
| System | Author/Year | Key Concepts |
|---|---|---|
| Hydra | Hughes (MIT SPARK) 2022 | Real-time hierarchical Scene Graph (mesh→objects→places→rooms→buildings) |
| Hydra-Multi | Hughes 2023 | Distributed multi-robot 3D Scene Graph |
| Clio | Maggio (MIT SPARK) 2024 | Open-set task-driven Scene Graph, CLIP embeddings per node |
| Khronos | Schmid (MIT SPARK) 2024 | Spatio-temporal Scene Graph, dynamic object history tracking |
| ConceptGraphs | Gu 2023 | Open-vocabulary 3D Scene Graph, SAM + CLIP + LLM relations (→ also in L3 Semantic) |
Level 6: VIO / VINS
Key Concepts
- Tightly-coupled vs Loosely-coupled — Joint vs separate optimization of visual and inertial measurements
- Filter-based vs Optimization-based — EKF approaches vs nonlinear optimization (BA)
- IMU preintegration — On-manifold IMU integration between keyframes (Forster 2015)
- IMU noise model — Bias, random walk, Allan variance
- Observability — Yaw and global position are unobservable in VIO
Foundations
| Resource | Author/Year | Key Concepts |
|---|---|---|
| 📖 Introduction to Inertial Navigation | Woodman 2007 | IMU fundamentals, coordinate frames, error sources — essential prerequisite |
| IMU Preintegration on Manifold | Forster 2015 | On-manifold preintegration, bias correction without re-integration |
| Quaternion kinematics for error-state KF | Sola 2017 | Quaternion math, error-state formulation |
Filter-based
| System | Author/Year | Key Concepts |
|---|---|---|
| MSCKF | Mourikis 2007 | Multi-State Constraint KF, efficient VIO without landmarks in state |
| ROVIO | Bloesch 2015 | Robocentric VIO, direct photometric tracking + EKF |
| OpenVINS | Geneva 2020 | Open-source MSCKF, modular, extensible |
Optimization-based
| System | Author/Year | Key Concepts |
|---|---|---|
| OKVIS | Leutenegger 2015 | Keyframe-based, tightly-coupled, sliding window optimization |
| VINS-Mono | Qin 2018 | Tightly-coupled, relocalization, loop closure, pose graph optimization |
| VINS-Fusion | Qin 2019 | Stereo + GPS fusion extension |
| MAPLAB | Schneider 2018 | Multi-session visual-inertial mapping framework |
| Kimera-VIO | Rosinol 2020 | Fast VIO frontend for Kimera pipeline, structureless vision factors |
| Basalt | Usenko 2020 | Non-linear recovery, visual-inertial odometry + mapping |
| ORB-SLAM3 | Campos 2020 | VIO mode, multi-map, IMU initialization |
| DM-VIO | von Stumberg 2022 | Deep monocular VIO, delayed marginalization |
| OKVIS2 | Leutenegger 2022 | Multi-session, improved marginalization |
| AirVO | Xu 2023 | Point-line VIO, illumination-robust |
| OKVIS2-X | Boche & Leutenegger 2025 | Multi-sensor SLAM (Visual+Inertial+Depth+LiDAR+GNSS), dense volumetric occupancy maps, submapping for large-scale (9km+), EuRoC/Hilti22 SOTA |
Level 7: World Models & Spatial AI
World Models
| System | Author/Year | Key Concepts |
|---|---|---|
| GAIA-1 | Wayve 2023 | Driving World Model, action-conditioned future scene generation |
| Sora / DiT | OpenAI 2024 | Diffusion Transformer, spacetime patches, emergent 3D understanding |
| NVIDIA Cosmos | NVIDIA 2026 | World Foundation Model platform for Physical AI, synthetic data for AV/robots |
| World Labs / Marble | Fei-Fei Li 2026 | 3D world generation from images/video/text ($1B funding) |
| WorldVLA | Alibaba 2025 | Autoregressive action world model, learns physics for action generation |
| SceneDINO | 2025 | Feed-forward unsupervised semantic scene completion |
Generative 3D
| System | Author/Year | Key Concepts |
|---|---|---|
| DreamFusion | Poole 2023 | Text-to-3D via Score Distillation Sampling (SDS) + NeRF |
Vision-Language Models (VLM)
| System | Author/Year | Key Concepts |
|---|---|---|
| CLIP | Radford (OpenAI) 2021 | Contrastive image-text pretraining, 400M pairs, zero-shot |
| SigLIP | Zhai (Google) 2023 | Sigmoid loss CLIP, more efficient, better at small model sizes |
| BLIP-2 | Li (Salesforce) 2023 | Q-Former bridges frozen LLM + image encoder |
| LLaVA | Liu 2023 | LLaMA + vision, conversational VLM |
Vision-Language-Action Models (VLA)
| System | Author/Year | Key Concepts |
|---|---|---|
| RT-2 | Brohan (DeepMind) 2023 | Robot actions as text tokens, emergent generalization |
| OpenVLA | Kim 2024 | Open-source VLA, SigLIP + Llama 7B + Action Head |
| Navila | 2024 | Navigation-specialized VLA, SLAM integration for localization |
Resources
| Resource | Key Concepts |
|---|---|
| Awesome-Transformer-based-SLAM | Curated GitHub list of Transformer-based SLAM methods |
Level 8: Stereo SLAM
Key Concepts
- Stereo rectification — Epipolar alignment for efficient disparity search
- Disparity vs Depth — d = f·B/Z, baseline determines depth range/accuracy
- Scale observability — Stereo provides metric scale (unlike monocular)
Systems
| System | Author/Year | Key Concepts |
|---|---|---|
| S-PTAM | Pire 2017 | Stereo PTAM, ROS-compatible, real-time |
| ORB-SLAM2 (stereo) | Mur-Artal 2016 | Stereo + RGB-D modes, loop closure, relocalization |
| StereoMSCKF | Sun 2018 | MSCKF with stereo, efficient for resource-constrained platforms |
| RTAB-Map | Labbé 2019 | Multi-sensor (stereo/RGB-D/LiDAR), memory management, large-scale |
| ORB-SLAM3 (stereo) | Campos 2020 | Multi-map, Atlas, stereo + IMU |
| Stella-VSLAM | Community 2022 | Open-source fork of OpenVSLAM, stereo support |
| LDSO | Gao 2018 | Direct stereo odometry with loop closure (DSO extension) |
Level 9: Collaborative / Multi-Robot SLAM
Key Concepts
- Centralized vs Decentralized — Single server vs peer-to-peer map merging
- Inter-robot loop closure — Place recognition across robots with different viewpoints
- Communication constraints — Bandwidth-limited map sharing, sparse descriptors
- Map merging — Aligning submaps from different robots into a global map
Systems
| System | Author/Year | Key Concepts |
|---|---|---|
| C2TAM | Riazuelo 2014 | Cloud-based collaborative monocular SLAM |
| CCM-SLAM | Schmuck & Chli 2019 | Centralized collaborative monocular SLAM, robust to comm failures |
| DOOR-SLAM | Lajoie 2020 | Distributed, outlier-resilient SLAM with pairwise consistency |
| Kimera-Multi | Tian 2022 | Distributed multi-robot metric-semantic SLAM, mesh reconstruction |
| Swarm-SLAM | Lajoie 2024 | Decentralized, sparse, scalable C-SLAM, supports LiDAR/stereo/RGB-D |
| CoPeD-Advancing | Stathoulopoulos 2024 | Multi-robot collaborative perception for autonomous exploration |
| MAPLAB 2.0 | Cramariuc 2023 | Multi-session, multi-robot visual-inertial mapping |
Level 10: LiDAR & Visual-LiDAR Fusion SLAM
Key Concepts
- LiDAR-Visual-Inertial (LVI) — Triple fusion for robust outdoor SLAM
- Tightly-coupled LiDAR-camera — Joint optimization of point cloud and visual features
- Direct LiDAR-camera alignment — Photometric/geometric alignment without feature extraction
- Degradation handling — Graceful fallback when one modality fails (e.g., LiDAR in rain, camera in darkness)
- Range image — 2D projection of LiDAR scans for efficient processing (SuMa, RangeNet++)
LiDAR / LiDAR-Inertial SLAM
| System | Author/Year | Key Concepts |
|---|---|---|
| LOAM | Zhang 2014 | LiDAR odometry and mapping (foundational), edge + planar features |
| SuMa | Behley (Bonn) 2018 | Surfel-based LiDAR SLAM, projective ICP on range images |
| SuMa++ | Chen (Bonn) 2019 | SuMa + RangeNet++ semantics, semantic ICP weighting, dynamic object filtering |
| LIO-SAM | Shan 2020 | Tightly-coupled LiDAR-inertial, factor graph, GPS fusion |
| FAST-LIO2 | Xu 2022 | Direct LiDAR-inertial, ikd-Tree, extremely fast |
| PIN-SLAM | Pan (Bonn) 2024 | Neural point cloud LiDAR SLAM, point-to-SDF registration, elastic map deformation for loop closure |
Visual-LiDAR Fusion SLAM
| System | Author/Year | Key Concepts |
|---|---|---|
| LVI-SAM | Shan 2021 | LiDAR-Visual-Inertial via factor graph, LIO-SAM + VINS-Mono |
| R3LIVE | Lin 2022 | Real-time LiDAR-Visual-Inertial, dense RGB point cloud map |
| R3LIVE++ | Lin 2023 | Improved R3LIVE with mesh reconstruction |
| FAST-LIVO | Zheng 2022 | FAST-LIO + direct visual odometry, tightly-coupled LVI |
| FAST-LIVO2 | Zheng 2024 | Improved, sequential image processing, direct photometric fusion |
| OKVIS2-X | Boche 2025 | Visual+Inertial+Depth+LiDAR+GNSS configurable (also in Level 6) |
Resources
| Resource | Key Concepts |
|---|---|
| LiDAR-Visual-Inertial Survey (Zheng 2024) | Comprehensive survey of LVI SLAM systems |
Level 11: Event Camera SLAM
Key Concepts
- Event cameras (DVS) — Asynchronous per-pixel brightness change detection, μs temporal resolution
- Advantages — HDR (140dB+), no motion blur, low latency, low power
- Challenges — No absolute intensity, sparse asynchronous output, requires new algorithms
- Event representations — Event frames, time surfaces, voxel grids, spike tensors
Foundations
| Resource | Author/Year | Key Concepts |
|---|---|---|
| 📖 Event-based Vision Survey | Gallego 2020 | Comprehensive survey of event camera algorithms |
| Awesome-Event-based-SLAM | KwanWaiPang | Curated GitHub list of event-based SLAM papers |
Systems
| System | Author/Year | Key Concepts |
|---|---|---|
| EVO | Rebecq 2017 | Event-based Visual Odometry, 3D reconstruction from events |
| ESVO | Zhou 2021 | Event-based Stereo Visual Odometry |
| Ultimate-SLAM | Vidal 2018 | Events + frames + IMU fusion |
| EKLT | Gehrig 2020 | Event-based KLT feature tracking |
| ESVIO | Chen 2023 | Event-based Stereo VIO |
| EDS | Hidalgo-Carrió 2022 | Event-aided direct sparse odometry |
| DEVO | Pellerito 2024 | Deep event-based visual odometry (DROID-SLAM style) |
| VIO-GO | 2025 | Event-based VIO with optimized parameters for HDR scenarios |
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels