2026.02.25 - #63 - Visual SLAM roadmap, GEAR-SONIC, AMD+Meta, Universal Beta Splatting, Reconstruct Anything, Qwen3.5, TinyClaw, Chroma, PointCloudCrafter, Splat Feature Solver

📘 Study Roadmap : Visual-SLAM (Beginner → Master)

> Created: 2026-02-22
---

## Level 1: Beginner

### Programming
- **C++**: Pointer, OOP
- **Python**
- **Bash/Linux**: Basic terminal usage

### Mathematics
- **Basic Probability & Statistics**: Gaussian distribution, Bayes' theorem
- **Basic Linear Algebra**: Vectors & Matrices, Determinant, Dot & Cross product, Rank, Inverse matrix, Transpose matrix, SVD, Eigenvalues/Eigenvectors
- **Logarithm & Exponential**
- **Basic Calculus**: Differentiation, Taylor expansion

### Projective Geometry
- **Pinhole camera model** → Image projection
- **Camera calibration**: Intrinsic/Extrinsic parameters, Lens distortion
- **Rigid body motion**: Euler/Quaternion/Rotation Matrix, Projective space & Vanishing point, Homogeneous transformation
- **Epipolar geometry** → Essential & Fundamental matrix
- **Triangulation**

### Camera Device
- Lens, Sensor, Resolution/ISO/Aperture

### Image Data
- Colour image, Resolution, Grayscale image
- Thresholding, Gaussian blur
- **Corner detector**: Harris corner
- **Edge detector**: Sobel & Canny Edge
- Stereovision, RGB-D, Disparity, Depth

---

## Level 2: Getting Familiar with SLAM

### Programming
- **C++**: OOP, Modern C++, Data structures & Algorithms, Compilers, CMake/Makefile/Ninja, Design patterns, OpenCV C++
- **C**
- **Git/GitHub**
- **OpenCV** (opencv-python)
- **Python**: Deep learning, Graph plots, System scripts
- **Bash/Linux**: ssh, CLI text editor/Vim/tmux
- **Concurrency**: SIMD-SSE/AVX/Neon, OpenMP, CUDA
- **Mobile**: Android (Java/Kotlin), iOS (Objective-C/Swift)
- **Maths library**: Eigen, Ceres-solver/GTSAM/g2o
- **C++/Python interop**: PyBind11, nanobind
- **Docker**
- **C#**: COLMAP, Unity AR, Microsoft Hololens
- **CI/CD**: GitHub Actions, Apache Airflow
- **ROS/ROS2**
- **Simulation**: Gazebo, Isaac Sim

### Image Processing
- **Keypoints** → Detector/Descriptor
  - SIFT, FAST, ORB, AKAZE
  - Deep features: R2D2, Superpoint
- Image pyramid, oFAST, rBRIEF

### Local Feature Matching
- Brute-Force, FLANN, Kd-Tree
- LSH, Multi-probe LSH, HBST
- Superglue

### Global Feature Matching
- Bag of Visual Words, NetVLAD
- Deep image retrieval, Hierarchical localization

### Feature Tracking
- Optical flow, KLT Tracker

### Multiple View Geometry
- **2D-2D correspondence**: Essential/Fundamental, Homography
- **2D-3D correspondence**: P3P, PnP, SVD
- **3D-3D correspondence**: ICP

### Outlier Rejection
- RANSAC, PROSAC, M-Estimator, MAXCON, Convex relaxation

### Least Squares Optimisation
- Reprojection error, Bundle adjustment
- Non-linear optimisation, Lie algebra
- **Lie groups**: SO(3), SE(3)
- Gauss-Newton, Levenberg-Marquardt
- **Pose graph optimization**
- **Schur complement / Sparsity**

### Motion Model
- **Proprioceptive sensor**: IMU, Wheel
- **Odometry** (pose)

### Observation Model
- **Exteroceptive sensor**: Camera, LiDAR
- **Landmark** (Map)
- Joint optimisation, MLE & MAP

### Factor Graph Optimisation

### Mapping
- Point cloud, Occupancy grid mapping, TSDF, Surfel, Voxel map

### Sensors
- **Camera device**: Wide/telecentric lens, Lens MTF, CCD/CMOS, Rolling/Global shutter, Exposure/ISO, Stereovision, RGB-D, Structured light, Active IR/ToF
- **LiDAR** → Visual-LiDAR fusion
- **IMU** → VIO
- **RADAR** → Sensor fusion, Extended Kalman filter
- **Sonar**
- **Multi-sensor calibration**: Camera-IMU, Camera-LiDAR

### Evaluation
- **Metrics**: ATE (Absolute Trajectory Error), RPE (Relative Pose Error)
- **Datasets**: KITTI, TUM RGB-D, EuRoC

### Next Levels
Monocular SLAM · VIO/VINS · Stereo SLAM · Visual-LiDAR Fusion · RGB-D SLAM · Collaborative SLAM · Deep SLAM/Localization

---

## Level 3: Monocular Visual-SLAM

### Key Concepts
- **VO vs SLAM** — VO is local (no loop closure), SLAM includes global map + loop closure
- **Scale ambiguity** — Fundamental limitation of monocular SLAM; absolute scale is unrecoverable from images alone
- **Covisibility graph** — Shared map point visibility between keyframes; core data structure in ORB-SLAM
- **Visual Place Recognition (VPR)** — Recognising previously visited places for loop closure
- **Self-supervised depth** — Learning monocular depth without ground truth (Monodepth2, Godard 2019)

### Feature-based SLAM

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| Visual Odometry | Nister 2004 | Fundamental matrix, Triangulation, VO (local-only, no loop closure) |
| PTAM | Klein & Murray 2007 | FAST feature, Tracking, **Frontend/Backend separation**, Parallel threads, Keyframe, Mapping, Bundle adjustment, Manual initialisation |
| Visual-SLAM why filter? | Strasdat 2012 | Bundle adjustment, Scale-aware BA, Motion-only BA |
| **ORB-SLAM** | Mur-Artal 2015 | ORB keypoint, **Automatic initialisation (Homography vs Fundamental selection)**, Tracking thread, Sliding-window BA, Local mapping, Large-scale, Loop closure, Bag of visual words, Global optimisation, Covisibility graph, **Map point management (culling, merging)** |
| Pop-up SLAM | Yang 2016 | Line/Plane features |
| PL-SLAM | Pumarola 2017 | Point/Line features |
| **ORB-SLAM2** | Mur-Artal 2017 | → Stereo SLAM, → RGB-D SLAM |
| CubeSLAM | Yang 2019 | Monocular 3D cuboid detection + SLAM, 9-DoF object representation |
| OpenVSLAM | Sumikura 2019 | — |
| **Stella-VSLAM** | (fork) 2021 | OpenVSLAM successor, license reboot |
| UcoSLAM | Munoz-Salinas 2019 | Fiducial markers |
| DeepFusion | LaidLow 2019 | — |
| **ORB-SLAM3** | Campos 2020 | Monocular + Stereo + VIO, Multi-map, IMU integration |
| DXSLAM | Li 2020 | Deep features for SLAM |
| **PyCuVSLAM** | NVIDIA 2026 | Python + CUDA GPU-accelerated VSLAM toolkit (cuVSLAM wrapper) |

### Direct SLAM

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| **DTAM** | Newcombe 2011 | Dense mapping, Keyframe mapping, GPGPU |
| **LSD-SLAM** | Engel 2014 | Photometric error minimisation, High gradient pixels/edges, Large scale, Loop closure, Pose graph optimisation |
| **DSO** | Engel 2016 | Photometric bundle adjustment, Sliding window BA, No loop closure/global optimisation |
| **LDSO** | Gao 2018 | DSO + Loop closure (BoW-based), addresses DSO's main weakness |
| CNN-SLAM | Tateno 2017 | Depth from LSD-SLAM + deep depth, Semantic label |
| DVSO | Yang 2018 | Deep single image depth estimation, StackNet |
| Basalt | Usenko 2020 | Non-linear recovery (→ primarily VIO, see Level 6) |
| D3VO | Yang 2020 | Deep single image depth estimation, Deep pose, Deep aleatoric uncertainty |

### Hybrid (Feature + Direct)

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| SVO | Forster 2014 | FAST feature detection, Direct-based feature tracking, Bundle adjustment |
| SVO2 | Forster 2017 | Multi-camera/Fisheye, Probabilistic depth estimation, Direct method convergence, Sparse method |
| Stereo DSO | Wang 2017 | → Stereo SLAM |
| VI-DSO | Gao 2018 | → VIO/VINS |

### Learning-based SLAM

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| **DROID-SLAM** | Teed 2021 | Differentiable BA, dense optical flow, end-to-end learned |
| TartanVO | Wang 2021 | Generalizable visual odometry |
| **DPV-SLAM / DPVO** | Teed 2023 | DROID-SLAM lightweight, patch-based visual odometry |
| MAC-VO | Qu 2024 | Learning-based VO, metric-aware |
| **VoT** | Yugay 2025 | Visual Odometry with Transformers |

### Foundation Model SLAM

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| **DUSt3R** | Wang 2024 | Pointmap regression from image pairs, no calibration needed |
| **MASt3R** | Leroy 2024 | DUSt3R + local feature matching |
| **MASt3R-SLAM** | Leroy 2024 | Real-time dense SLAM from MASt3R |
| **VGGT** | Wang (Meta) 2025 | Feed-forward inference of poses, depths, pointmaps, tracks from N views (**CVPR 2025 Best Paper**) |
| **VGGT-SLAM** | 2025 | VGGT as frontend for real-time SLAM |
| **VGGT-SLAM 2.0** | 2026 | Improved VGGT-SLAM |
| **VGGT-Geo** | 2026 | Probabilistic geometric fusion of VGGT priors for dense indoor SLAM |
| **IGGT** | Li 2026 | VGGT + VLM — language-grounded 3D geometry |
| **AMB3R** | Wang 2025 | MASt3R frontend + Transformer backend for SfM/SLAM |
| **MASt3R-Fusion** | WHU 2025 | MASt3R-SLAM + IMU + GNSS fusion |

#### SfM Tools

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| **InstantSfM** | 2025 | GPU-accelerated SfM pipeline, 40× faster than COLMAP |

### Neural Representation SLAM

#### NeRF-based

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| **iMAP** | Sucar 2021 | First NeRF-SLAM, single MLP, real-time tracking/mapping |
| **BARF** | Lin 2021 | Bundle-Adjusting NeRF, coarse-to-fine positional encoding, joint pose+NeRF opt (not full SLAM — pose+NeRF co-optimization) |
| **NICE-SLAM** | Zhu & Peng 2022 | Hierarchical feature grid (coarse/mid/fine), scalable |
| **Co-SLAM** | Wang 2023 | Hash grid (Instant-NGP) + coordinate encoding, 5-10× faster than NICE-SLAM |
| **ESLAM** | Johari 2023 | Tri-plane representation, O(N²) vs O(N³) memory |
| **Point-SLAM** | Sandström 2023 | Neural point cloud based |
| **NeRF-SLAM** | Rosinol 2023 | NeRF + classical SLAM pipeline |
| **NICER-SLAM** | Zhu 2024 | RGB-only NeRF-SLAM (no depth sensor), monocular depth integration |
| **vMAP** | Kong 2023 | Object-level NeRF-SLAM, per-object neural fields |
| **GO-SLAM** | Zhang 2023 | Global optimization + NeRF-SLAM, loop closure + global BA |

#### 3DGS-based

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| **SplaTAM** | Keetha 2024 | First 3DGS-SLAM, RGB-D, silhouette-guided densification |
| **MonoGS** | Matsuki 2024 | Monocular 3DGS-SLAM, depth network + triangulation fusion |
| **GS-ICP SLAM** | Yu 2024 | Gaussian-to-Gaussian ICP (Mahalanobis distance), geometric tracking |
| **Photo-SLAM** | Huang 2024 | Explicit geometry + implicit appearance (MLP color), anti-aliasing |
| **RTG-SLAM** | 2024 | Real-time focus, adaptive Gaussian budget, Jetson Orin 25 FPS |
| **EGG-Fusion** | ZJU 2025 | Gaussian surfel fusion, information-filter-based, real-time 24 FPS |
| **Online-Mono-3DGS (MODP)** | 2025 | ORB-SLAM3 tracking + Hierarchical Gaussian Management |
| **ActiveSplat** | Li 2025 | Active mapping with 3DGS + Voronoi-based path planning |
| **Open-S3SLAM** | 2026 | Open-set semantic 3DGS SLAM for smartphones (ICRA 2026) |
| **LEGS** | 2025 | Language Embedded Gaussian Splats, real-time language-queryable 3D |

### Semantic / Language-Grounded SLAM

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| **ConceptFusion** | Jatavallabhula (MIT) 2023 | CLIP features fused into 3D map, open-vocabulary language queries |
| **LERF** | Kerr 2023 | Language Embedded Radiance Fields, DINO multi-scale, NeRF + CLIP |
| **OpenScene** | Peng (ETH) 2023 | Language features back-projected to 3D point clouds |
| **ConceptGraphs** | Gu 2023 | Open-vocabulary 3D Scene Graph, SAM + CLIP + LLM spatial relations |
| **SpatialLLM** | Mao 2025 | Point cloud → LLM, structured indoor modeling as Python scripts |

> Also see: **LEGS**, **Open-S3SLAM** (3DGS-based section above); **Open-YOLO 3D** (Level 5 Object Detection)

---

## Level 4: RGB-D Visual-SLAM

### RGB-D Camera Devices
- Intel RealSense D series
- Microsoft Kinect v1/v2
- Azure Kinect DK
- Occipital Structure Core
- Orbbec Astra

### GPGPU Programming
- CUDA, OpenGL GLSL

### Systems

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| ICP | Besl & McKay 1992 | — |
| DTAM | Newcombe 2011 | — |
| **KinectFusion** | Newcombe 2011 | GPGPU, Tracking (project depth → 3D, surface normal, coarse-to-fine ICP), Mapping (volumetric integration, TSDF), Robust to small scene changes, Cannot model deformation, Map growth cubic, Room-size only |
| Double Window Optimisation | Strasdat 2011 | — |
| Kintinuous | Whelan 2012 | Volume shift, Geometric, Photometric, dBoW+SURF, Optimisation, Loop closure |
| RGBD-SLAM-V2 | Endres 2013 | Tracking (colour image, visual features, depth image, point cloud, transformation), Mapping (OctoMap 2013) |
| SLAM++ | Salas-Moreno 2013 | Object-oriented SLAM |
| DVO | Kerl 2013 | Keyframe, Depth, Direct method, Optimisation, Loop closure |
| RTAB-Map | Labbé 2014 | Loop closure, Map merge, Multi-session memory management |
| MRS-Map | Stuckler 2014 | — |
| **ElasticFusion** | Whelan 2015 | Active: frame-to-model tracking (photometric + geometric), joint optimisation, fused surfel-based model reconstruction · Inactive: local loop closure (model-to-model local surface, submodel separation), global loop closure (randomised fern encoding, non-rigid space deformation) |
| DynamicFusion | Newcombe 2015 | 6D motion field, Deformable scene |
| ORB-SLAM2 | Mur-Artal 2016 | Bundle adjustment, Sparse reconstruction |
| **BundleFusion** | Dai 2016 | Local-to-global optimisation, Sparse RGB feature, Coarse global pose estimation, Fine pose refinement (geometric + photometric) |
| SemanticFusion | McCormac 2016 | Deep Learning CNN, Deep Semantic SLAM |
| InfiniTAM v3 | Prisacariu 2017 | Tracking (scene raycast, depth image, RGB image), Relocalisation (random ferns), Mapping (TSDF reconstruction, voxel hashing, surfel reconstruction) |
| Fusion++ | McCormac & Clark 2018 | Deep Learning CNN, Mask-RCNN instance segmentation, Object-level SLAM, No prior, Object-level TSDF reconstruction |
| PointFusion / DenseFusion | Xu 2018 / Wang 2019 | RGB-D object pose estimation, Tracking, Relocalisation, Loop closure detection |
| BAD SLAM | Schops 2019 | Direct bundle adjustment, Deep Semantic SLAM |
| RTAB-Map v2 | Labbé 2019 | RGB-D/LiDAR, Light-source detection (2016) |
| **MoreFusion** | Wada & Sucar 2020 | DL instance segmentation, Object-level volumetric fusion, Volumetric pose prediction, 3D scene reconstruction, Collision-based refinement, Semantic SLAM, Object pose estimation, CAD object fitting |
| NodeSLAM | Wada & Sucar 2020 | Occupancy VAE, Object-level SLAM (→ also in Level 5 Latent Representation) |
| Kimera / 3D Dynamic Scene Graph | Rosinol 2020 | Kimera-VIO, Kimera-Mesher, Kimera-PGMO, Kimera-Semantics, Kimera-DSG |
| **DSP-SLAM** | Wang (UCL) 2021 | DeepSDF shape prior + ORB-SLAM2, object-level dense reconstruction (mono/stereo/LiDAR) |

---

## Level 5: Applying Deep Learning

> Level 5 is organized into four pillars:
> **A. Frontend** — learned perception components replacing hand-crafted modules
> **B. Backend** — learned/certifiable optimization replacing classical solvers
> **C. Systems** — end-to-end deep VO/SLAM pipelines
> **D. Scene Understanding** — semantic, language, and relational reasoning on SLAM maps

---

### A. Deep Frontend — Perception

#### Feature Detection & Matching

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| **NetVLAD** | Arandjelovic 2016 | VLAD, place recognition |
| **SuperPoint** | DeTone 2017 | Homographic Adaptation, Self-supervised, VGG encoder + detector/descriptor heads |
| HardNet | Mishchuk 2017 | Learned local descriptor |
| **R2D2** | Revaud 2019 | Repeatable + Reliable detector/descriptor, explicit repeatability/reliability maps |
| KeyNet | Barroso-Laguna 2019 | Learned keypoint detector |
| **HF-Net** | Sarlin 2019 | Global feature, Local feature, Visual localization |
| **SuperGlue** | Sarlin 2020 | Self/Cross-attention GNN, Sinkhorn optimal assignment, dustbin for outliers |
| **DISK** | Tyszkiewicz 2020 | Policy gradient (RL) training, match success/failure as reward |
| Patch NetVLAD | Hausler 2021 | Multi-scale patch-level VLAD |
| **LoFTR** | Sun 2021 | Detector-free, Transformer coarse-to-fine dense matching |
| **LightGlue** | Lindenberger 2023 | Adaptive depth/width, 5-10× faster than SuperGlue |
| **XFeat** | Potje 2024 | 0.3M params, 1400 FPS (RTX 4090), 64-dim descriptor, embedded-friendly |
| **RoMA** | Edstedt 2024 | DINOv2 foundation feature + coarse-to-fine dense matching |
| **DeDoDe** | Edstedt 2024 | Joint detect-and-describe in one stage |
| **RoMA V2** | Edstedt 2026 | Improved RoMA |

#### Depth Estimation

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| MonoDepth | Godard 2016 | Left-Right photometric consistency, self-supervised |
| **MiDaS** | Ranftl 2020 | Multi-dataset mixing, scale-and-shift invariant loss, relative depth |
| **DPT** | Ranftl 2021 | Dense Prediction Transformer (ViT backbone), global context |
| **ZoeDepth** | Bhat 2023 | Zero-shot metric depth, Metric Bins Module |
| **Metric3D** | Yin 2023 | Camera intrinsic-conditioned metric depth, Canonical Camera Space |
| **Depth Anything** | Yang 2024 | 62M images, foundation model for monocular depth |
| **Depth Anything V2** | Yang 2024 | Improved with synthetic data, better edge preservation |
| **Marigold** | Ke 2024 | Stable Diffusion for depth, fine detail, uncertainty via sampling |
| **Align3r** | Melou 2025 | Video temporal consistency, DUSt3R-based, CVPR 2025 Highlight |
| **Masked Depth Modeling (LingBot-Depth)** | 2026 | Fixes RGB-D failures on glass/mirrors/metal |

#### Optical Flow & Scene Flow

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| **FlowNet** | Dosovitskiy 2015 | First end-to-end deep optical flow (SimpleNet / CorrNet) |
| **FlowNet 2.0** | Ilg 2017 | Stacked networks, classical-level accuracy |
| **PWC-Net** | Sun 2018 | Pyramid-Warping-Cost volume, coarse-to-fine, 8.4M params |
| **FlowNet3D** | Liu 2019 | Point cloud scene flow, PointNet++ based |
| **RAFT** | Teed 2020 | All-Pairs Correlation + iterative ConvGRU update, **ECCV Best Paper** |
| **RAFT-3D** | Teed 2021 | Scene flow (3D motion) from RAFT |
| **FlowFormer** | Huang 2022 | Transformer on cost volume tokens, global context |
| **SEA-RAFT** | 2024 | Efficient RAFT variant for real-time |

#### Camera Pose Regression & Relocalization

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| **PoseNet** | Kendall 2015 | CNN-based 6-DoF pose regression (APR), GoogLeNet backbone |
| **DSAC** | Brachmann 2017 | Differentiable RANSAC, Scene Coordinate Regression (SCR) |
| **DSAC++** | Brachmann 2018 | Self-supervision, RGB-D support |
| CNN Pose Regression Limitations | Sattler 2019 | Pose regression ≈ image retrieval performance |
| LM-Reloc | von Stumberg 2020 | Deep direct relocalization |
| **DSAC*** | Brachmann 2021 | Improved learning stability |
| **ACE** | Brachmann 2023 | Accelerated Coordinate Encoding, 5-min training per scene |
| **ACE Zero** | Brachmann 2024 | Zero-shot SCR, no pre-built 3D map needed |
| **ACE-G** | Brachmann 2024 | Generalizable SCR via cross-attention, new scenes without fine-tuning |
| **ACE-SLAM** | Tang 2024 | Neural implicit real-time SLAM, network weights = map |
| **hloc** | Sarlin 2019+ | Hierarchical Localization: coarse (NetVLAD) → fine (SuperGlue) pipeline |

#### Object Detection & Segmentation for SLAM

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| **YOLO** (v1→v11) | Redmon 2016→2024 | Real-time object detection, Ultralytics ecosystem |
| **DETR** | Carion 2020 | Transformer detection, anchor-free, no NMS |
| **RT-DETR** | Lv (Baidu) 2023 | Real-time DETR, YOLO-speed + Transformer quality |
| **SAM** | Kirillov 2023 | Segment Anything, prompt-based, Foundation Model |
| **SAM 2** | Meta 2024 | Video segmentation, Memory Attention, temporal consistency |
| **Grounding DINO** | Liu 2023 | Text-prompted detection → SAM pipeline (Grounded SAM) |
| **Open-YOLO 3D** | Benseddik 2025 | 2D open-vocab detection → 3D instance seg, 16× faster |

---

### B. Deep Backend — Optimization

#### Differentiable Bundle Adjustment

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| **BA-Net** | Tang 2019 | FPN + differentiable LM layer, end-to-end SfM (ICLR) |
| **DROID-SLAM** | Teed 2021 | Dense optical flow + differentiable dense BA, all-pixels reprojection |
| **DPVO** | Teed 2023 | Patch-based DROID-SLAM, 30+ FPS real-time |
| **Theseus** | Pineda (Meta) 2022 | Differentiable nonlinear optimization library (PyTorch) |
| **Lietorch** | Teed 2021 | Lie group operations for PyTorch (SE(3)/SO(3)) |

#### Certifiably Optimal Algorithms

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| **SE-Sync** | Rosen 2019 | Certifiable pose graph optimization via SDP + Riemannian opt |
| **TEASER++** | Yang 2020 | Point cloud registration, 90%+ outlier robust, TLS + Max Clique (T-RO/RSS 2020) |
| **GNC** | Yang 2020 | Graduated Non-Convexity, continuation from convex → robust cost |
| **QUASAR** | Yang 2022 | Certifiable rotation averaging, SDP + robust cost |

#### Gaussian Belief Propagation & Graph Processors

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| **FutureMapping 1** | Davison 2018 | Computational structure of Spatial AI, GBP for SLAM |
| **FutureMapping 2** | Ortiz 2019 | GBP as core Spatial AI primitive, visual intro to GBP |
| **BA on Graph Processor** | Ortiz 2020 | Bundle Adjustment on Graphcore IPU, tile-based parallelism |
| **DANCeRS** | 2023 | GBP-based distributed consensus in robot swarms |

---

### C. End-to-End Deep VO / SLAM Systems

#### Self-supervised & Learned VO

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| DeepVO | Wang 2017 | Supervised learning |
| SfM-Learner | Zhou 2017 | Unsupervised, deep depth + deep pose |
| DeMoN | Ummenhofer 2017 | Depth + Motion from two frames, encoder-decoder |
| UndeepVO | Li 2018 | Stereo self-supervised, absolute scale recovery |
| DeepTAM | Zhou 2018 | Deep tracking and mapping, cost volume based |
| DeepV2D | Teed 2018 | Iterative depth from video, differentiable geometry layers |
| Depth from Video in the Wild | Gordon 2019 | Unconstrained video depth, learned camera intrinsics |
| Neural Ray Surfaces | Vasiljevic 2020 | Learned ray surface model, non-pinhole cameras |
| GradSLAM | Murthy 2020 | Differentiable SLAM framework (PyTorch, supports multiple SLAM backends) |
| DeepSLAM | Wang 2020 | TrackingNet, MappingNet, LoopNet |
| MonoRec | Wimbauer 2021 | Self-supervised monocular 3D reconstruction, moving objects |
| TANDEM | Koestler 2021 | Real-time tracking + dense mapping via MVS depth, DSO-based |
| **DROID-SLAM** | Teed 2021 | Dense BA + correlation, SOTA on TartanAir/EuRoC (→ see Differentiable BA) |
| **DPVO** | Teed 2023 | Patch-based lightweight DROID (→ see Differentiable BA) |

#### Latent Representation SLAM

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| **CodeSLAM** | Bloesch 2018 | Depth as 128-dim latent code, photometric BA on codes + poses |
| **SceneCode** | Zhi 2019 | Depth + semantic in single latent code, cross-modal constraints |
| **DeepFactors** | Czarnowski 2020 | Probabilistic depth codes + factor graph, GPU 30+ FPS |
| **NodeSLAM** | Sucar 2020 | Object-level DeepSDF codes, occupancy VAE per object |
| **CodeMapping** | Shao 2021 | Sparse SLAM + learned dense mapping, hybrid approach |

#### Neural Rendering (reference)

> NeRF/3DGS-based SLAM systems → see **Level 3: Neural Representation SLAM**

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| **NeRF** | Mildenhall 2020 | Neural Radiance Fields, novel view synthesis (foundational) |
| **DIFIX3D+** | 2026 | Single-step diffusion for 3D reconstruction artifact removal (post-processing) |

---

### D. Scene Understanding

#### Benchmarks & Foundations

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| **EFM3D** | Straub (Meta) 2024 | Egocentric Foundation Model 3D benchmark, depth/surface/semantic from ego-video |

#### 3D Scene Graph

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| **Hydra** | Hughes (MIT SPARK) 2022 | Real-time hierarchical Scene Graph (mesh→objects→places→rooms→buildings) |
| **Hydra-Multi** | Hughes 2023 | Distributed multi-robot 3D Scene Graph |
| **Clio** | Maggio (MIT SPARK) 2024 | Open-set task-driven Scene Graph, CLIP embeddings per node |
| **Khronos** | Schmid (MIT SPARK) 2024 | Spatio-temporal Scene Graph, dynamic object history tracking |
| **ConceptGraphs** | Gu 2023 | Open-vocabulary 3D Scene Graph, SAM + CLIP + LLM relations (→ also in L3 Semantic) |

---

## Level 6: VIO / VINS

### Key Concepts
- **Tightly-coupled vs Loosely-coupled** — Joint vs separate optimization of visual and inertial measurements
- **Filter-based vs Optimization-based** — EKF approaches vs nonlinear optimization (BA)
- **IMU preintegration** — On-manifold IMU integration between keyframes (Forster 2015)
- **IMU noise model** — Bias, random walk, Allan variance
- **Observability** — Yaw and global position are unobservable in VIO

### Foundations

| Resource | Author/Year | Key Concepts |
|----------|-------------|--------------|
| 📖 **Introduction to Inertial Navigation** | Woodman 2007 | IMU fundamentals, coordinate frames, error sources — essential prerequisite |
| IMU Preintegration on Manifold | Forster 2015 | On-manifold preintegration, bias correction without re-integration |
| Quaternion kinematics for error-state KF | Sola 2017 | Quaternion math, error-state formulation |

### Filter-based

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| **MSCKF** | Mourikis 2007 | Multi-State Constraint KF, efficient VIO without landmarks in state |
| ROVIO | Bloesch 2015 | Robocentric VIO, direct photometric tracking + EKF |
| **OpenVINS** | Geneva 2020 | Open-source MSCKF, modular, extensible |

### Optimization-based

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| OKVIS | Leutenegger 2015 | Keyframe-based, tightly-coupled, sliding window optimization |
| **VINS-Mono** | Qin 2018 | Tightly-coupled, relocalization, loop closure, pose graph optimization |
| VINS-Fusion | Qin 2019 | Stereo + GPS fusion extension |
| MAPLAB | Schneider 2018 | Multi-session visual-inertial mapping framework |
| **Kimera-VIO** | Rosinol 2020 | Fast VIO frontend for Kimera pipeline, structureless vision factors |
| Basalt | Usenko 2020 | Non-linear recovery, visual-inertial odometry + mapping |
| **ORB-SLAM3** | Campos 2020 | VIO mode, multi-map, IMU initialization |
| **DM-VIO** | von Stumberg 2022 | Deep monocular VIO, delayed marginalization |
| **OKVIS2** | Leutenegger 2022 | Multi-session, improved marginalization |
| AirVO | Xu 2023 | Point-line VIO, illumination-robust |
| **OKVIS2-X** | Boche & Leutenegger 2025 | Multi-sensor SLAM (Visual+Inertial+Depth+LiDAR+GNSS), dense volumetric occupancy maps, submapping for large-scale (9km+), EuRoC/Hilti22 SOTA |

---

## Level 7: World Models & Spatial AI

### World Models

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| **GAIA-1** | Wayve 2023 | Driving World Model, action-conditioned future scene generation |
| **Sora / DiT** | OpenAI 2024 | Diffusion Transformer, spacetime patches, emergent 3D understanding |
| **NVIDIA Cosmos** | NVIDIA 2026 | World Foundation Model platform for Physical AI, synthetic data for AV/robots |
| **World Labs / Marble** | Fei-Fei Li 2026 | 3D world generation from images/video/text ($1B funding) |
| **WorldVLA** | Alibaba 2025 | Autoregressive action world model, learns physics for action generation |
| **SceneDINO** | 2025 | Feed-forward unsupervised semantic scene completion |

### Generative 3D

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| **DreamFusion** | Poole 2023 | Text-to-3D via Score Distillation Sampling (SDS) + NeRF |

### Vision-Language Models (VLM)

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| **CLIP** | Radford (OpenAI) 2021 | Contrastive image-text pretraining, 400M pairs, zero-shot |
| **SigLIP** | Zhai (Google) 2023 | Sigmoid loss CLIP, more efficient, better at small model sizes |
| **BLIP-2** | Li (Salesforce) 2023 | Q-Former bridges frozen LLM + image encoder |
| **LLaVA** | Liu 2023 | LLaMA + vision, conversational VLM |

### Vision-Language-Action Models (VLA)

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| **RT-2** | Brohan (DeepMind) 2023 | Robot actions as text tokens, emergent generalization |
| **OpenVLA** | Kim 2024 | Open-source VLA, SigLIP + Llama 7B + Action Head |
| **Navila** | 2024 | Navigation-specialized VLA, SLAM integration for localization |

### Resources

| Resource | Key Concepts |
|----------|--------------|
| Awesome-Transformer-based-SLAM | Curated GitHub list of Transformer-based SLAM methods |

---

## Level 8: Stereo SLAM

### Key Concepts
- **Stereo rectification** — Epipolar alignment for efficient disparity search
- **Disparity vs Depth** — d = f·B/Z, baseline determines depth range/accuracy
- **Scale observability** — Stereo provides metric scale (unlike monocular)

### Systems

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| **S-PTAM** | Pire 2017 | Stereo PTAM, ROS-compatible, real-time |
| **ORB-SLAM2** (stereo) | Mur-Artal 2016 | Stereo + RGB-D modes, loop closure, relocalization |
| **StereoMSCKF** | Sun 2018 | MSCKF with stereo, efficient for resource-constrained platforms |
| **RTAB-Map** | Labbé 2019 | Multi-sensor (stereo/RGB-D/LiDAR), memory management, large-scale |
| **ORB-SLAM3** (stereo) | Campos 2020 | Multi-map, Atlas, stereo + IMU |
| **Stella-VSLAM** | Community 2022 | Open-source fork of OpenVSLAM, stereo support |
| **LDSO** | Gao 2018 | Direct stereo odometry with loop closure (DSO extension) |

---

## Level 9: Collaborative / Multi-Robot SLAM

### Key Concepts
- **Centralized vs Decentralized** — Single server vs peer-to-peer map merging
- **Inter-robot loop closure** — Place recognition across robots with different viewpoints
- **Communication constraints** — Bandwidth-limited map sharing, sparse descriptors
- **Map merging** — Aligning submaps from different robots into a global map

### Systems

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| **C2TAM** | Riazuelo 2014 | Cloud-based collaborative monocular SLAM |
| **CCM-SLAM** | Schmuck & Chli 2019 | Centralized collaborative monocular SLAM, robust to comm failures |
| **DOOR-SLAM** | Lajoie 2020 | Distributed, outlier-resilient SLAM with pairwise consistency |
| **Kimera-Multi** | Tian 2022 | Distributed multi-robot metric-semantic SLAM, mesh reconstruction |
| **Swarm-SLAM** | Lajoie 2024 | Decentralized, sparse, scalable C-SLAM, supports LiDAR/stereo/RGB-D |
| **CoPeD-Advancing** | Stathoulopoulos 2024 | Multi-robot collaborative perception for autonomous exploration |
| **MAPLAB 2.0** | Cramariuc 2023 | Multi-session, multi-robot visual-inertial mapping |

---

## Level 10: LiDAR & Visual-LiDAR Fusion SLAM

### Key Concepts
- **LiDAR-Visual-Inertial (LVI)** — Triple fusion for robust outdoor SLAM
- **Tightly-coupled LiDAR-camera** — Joint optimization of point cloud and visual features
- **Direct LiDAR-camera alignment** — Photometric/geometric alignment without feature extraction
- **Degradation handling** — Graceful fallback when one modality fails (e.g., LiDAR in rain, camera in darkness)
- **Range image** — 2D projection of LiDAR scans for efficient processing (SuMa, RangeNet++)

### LiDAR / LiDAR-Inertial SLAM

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| **LOAM** | Zhang 2014 | LiDAR odometry and mapping (foundational), edge + planar features |
| **SuMa** | Behley (Bonn) 2018 | Surfel-based LiDAR SLAM, projective ICP on range images |
| **SuMa++** | Chen (Bonn) 2019 | SuMa + RangeNet++ semantics, semantic ICP weighting, dynamic object filtering |
| **LIO-SAM** | Shan 2020 | Tightly-coupled LiDAR-inertial, factor graph, GPS fusion |
| **FAST-LIO2** | Xu 2022 | Direct LiDAR-inertial, ikd-Tree, extremely fast |
| **PIN-SLAM** | Pan (Bonn) 2024 | Neural point cloud LiDAR SLAM, point-to-SDF registration, elastic map deformation for loop closure |

### Visual-LiDAR Fusion SLAM

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| **LVI-SAM** | Shan 2021 | LiDAR-Visual-Inertial via factor graph, LIO-SAM + VINS-Mono |
| **R3LIVE** | Lin 2022 | Real-time LiDAR-Visual-Inertial, dense RGB point cloud map |
| **R3LIVE++** | Lin 2023 | Improved R3LIVE with mesh reconstruction |
| **FAST-LIVO** | Zheng 2022 | FAST-LIO + direct visual odometry, tightly-coupled LVI |
| **FAST-LIVO2** | Zheng 2024 | Improved, sequential image processing, direct photometric fusion |
| **OKVIS2-X** | Boche 2025 | Visual+Inertial+Depth+LiDAR+GNSS configurable (also in Level 6) |

### Resources

| Resource | Key Concepts |
|----------|--------------|
| **LiDAR-Visual-Inertial Survey** (Zheng 2024) | Comprehensive survey of LVI SLAM systems |

---

## Level 11: Event Camera SLAM

### Key Concepts
- **Event cameras (DVS)** — Asynchronous per-pixel brightness change detection, μs temporal resolution
- **Advantages** — HDR (140dB+), no motion blur, low latency, low power
- **Challenges** — No absolute intensity, sparse asynchronous output, requires new algorithms
- **Event representations** — Event frames, time surfaces, voxel grids, spike tensors

### Foundations

| Resource | Author/Year | Key Concepts |
|----------|-------------|--------------|
| 📖 **Event-based Vision Survey** | Gallego 2020 | Comprehensive survey of event camera algorithms |
| Awesome-Event-based-SLAM | KwanWaiPang | Curated GitHub list of event-based SLAM papers |

### Systems

| System | Author/Year | Key Concepts |
|--------|-------------|--------------|
| **EVO** | Rebecq 2017 | Event-based Visual Odometry, 3D reconstruction from events |
| **ESVO** | Zhou 2021 | Event-based Stereo Visual Odometry |
| **Ultimate-SLAM** | Vidal 2018 | Events + frames + IMU fusion |
| **EKLT** | Gehrig 2020 | Event-based KLT feature tracking |
| **ESVIO** | Chen 2023 | Event-based Stereo VIO |
| **EDS** | Hidalgo-Carrió 2022 | Event-aided direct sparse odometry |
| **DEVO** | Pellerito 2024 | Deep event-based visual odometry (DROID-SLAM style) |
| **VIO-GO** | 2025 | Event-based VIO with optimized parameters for HDR scenarios |



System	Author/Year	Key Concepts
Visual Odometry	Nister 2004	Fundamental matrix, Triangulation, VO (local-only, no loop closure)
PTAM	Klein & Murray 2007	FAST feature, Tracking, Frontend/Backend separation, Parallel threads, Keyframe, Mapping, Bundle adjustment, Manual initialisation
Visual-SLAM why filter?	Strasdat 2012	Bundle adjustment, Scale-aware BA, Motion-only BA
ORB-SLAM	Mur-Artal 2015	ORB keypoint, Automatic initialisation (Homography vs Fundamental selection), Tracking thread, Sliding-window BA, Local mapping, Large-scale, Loop closure, Bag of visual words, Global optimisation, Covisibility graph, Map point management (culling, merging)
Pop-up SLAM	Yang 2016	Line/Plane features
PL-SLAM	Pumarola 2017	Point/Line features
ORB-SLAM2	Mur-Artal 2017	→ Stereo SLAM, → RGB-D SLAM
CubeSLAM	Yang 2019	Monocular 3D cuboid detection + SLAM, 9-DoF object representation
OpenVSLAM	Sumikura 2019	—
Stella-VSLAM	(fork) 2021	OpenVSLAM successor, license reboot
UcoSLAM	Munoz-Salinas 2019	Fiducial markers
DeepFusion	LaidLow 2019	—
ORB-SLAM3	Campos 2020	Monocular + Stereo + VIO, Multi-map, IMU integration
DXSLAM	Li 2020	Deep features for SLAM
PyCuVSLAM	NVIDIA 2026	Python + CUDA GPU-accelerated VSLAM toolkit (cuVSLAM wrapper)

System	Author/Year	Key Concepts
DTAM	Newcombe 2011	Dense mapping, Keyframe mapping, GPGPU
LSD-SLAM	Engel 2014	Photometric error minimisation, High gradient pixels/edges, Large scale, Loop closure, Pose graph optimisation
DSO	Engel 2016	Photometric bundle adjustment, Sliding window BA, No loop closure/global optimisation
LDSO	Gao 2018	DSO + Loop closure (BoW-based), addresses DSO's main weakness
CNN-SLAM	Tateno 2017	Depth from LSD-SLAM + deep depth, Semantic label
DVSO	Yang 2018	Deep single image depth estimation, StackNet
Basalt	Usenko 2020	Non-linear recovery (→ primarily VIO, see Level 6)
D3VO	Yang 2020	Deep single image depth estimation, Deep pose, Deep aleatoric uncertainty

System	Author/Year	Key Concepts
SVO	Forster 2014	FAST feature detection, Direct-based feature tracking, Bundle adjustment
SVO2	Forster 2017	Multi-camera/Fisheye, Probabilistic depth estimation, Direct method convergence, Sparse method
Stereo DSO	Wang 2017	→ Stereo SLAM
VI-DSO	Gao 2018	→ VIO/VINS

System	Author/Year	Key Concepts
DROID-SLAM	Teed 2021	Differentiable BA, dense optical flow, end-to-end learned
TartanVO	Wang 2021	Generalizable visual odometry
DPV-SLAM / DPVO	Teed 2023	DROID-SLAM lightweight, patch-based visual odometry
MAC-VO	Qu 2024	Learning-based VO, metric-aware
VoT	Yugay 2025	Visual Odometry with Transformers

System	Author/Year	Key Concepts
DUSt3R	Wang 2024	Pointmap regression from image pairs, no calibration needed
MASt3R	Leroy 2024	DUSt3R + local feature matching
MASt3R-SLAM	Leroy 2024	Real-time dense SLAM from MASt3R
VGGT	Wang (Meta) 2025	Feed-forward inference of poses, depths, pointmaps, tracks from N views (CVPR 2025 Best Paper)
VGGT-SLAM	2025	VGGT as frontend for real-time SLAM
VGGT-SLAM 2.0	2026	Improved VGGT-SLAM
VGGT-Geo	2026	Probabilistic geometric fusion of VGGT priors for dense indoor SLAM
IGGT	Li 2026	VGGT + VLM — language-grounded 3D geometry
AMB3R	Wang 2025	MASt3R frontend + Transformer backend for SfM/SLAM
MASt3R-Fusion	WHU 2025	MASt3R-SLAM + IMU + GNSS fusion

System	Author/Year	Key Concepts
iMAP	Sucar 2021	First NeRF-SLAM, single MLP, real-time tracking/mapping
BARF	Lin 2021	Bundle-Adjusting NeRF, coarse-to-fine positional encoding, joint pose+NeRF opt (not full SLAM — pose+NeRF co-optimization)
NICE-SLAM	Zhu & Peng 2022	Hierarchical feature grid (coarse/mid/fine), scalable
Co-SLAM	Wang 2023	Hash grid (Instant-NGP) + coordinate encoding, 5-10× faster than NICE-SLAM
ESLAM	Johari 2023	Tri-plane representation, O(N²) vs O(N³) memory
Point-SLAM	Sandström 2023	Neural point cloud based
NeRF-SLAM	Rosinol 2023	NeRF + classical SLAM pipeline
NICER-SLAM	Zhu 2024	RGB-only NeRF-SLAM (no depth sensor), monocular depth integration
vMAP	Kong 2023	Object-level NeRF-SLAM, per-object neural fields
GO-SLAM	Zhang 2023	Global optimization + NeRF-SLAM, loop closure + global BA

System	Author/Year	Key Concepts
SplaTAM	Keetha 2024	First 3DGS-SLAM, RGB-D, silhouette-guided densification
MonoGS	Matsuki 2024	Monocular 3DGS-SLAM, depth network + triangulation fusion
GS-ICP SLAM	Yu 2024	Gaussian-to-Gaussian ICP (Mahalanobis distance), geometric tracking
Photo-SLAM	Huang 2024	Explicit geometry + implicit appearance (MLP color), anti-aliasing
RTG-SLAM	2024	Real-time focus, adaptive Gaussian budget, Jetson Orin 25 FPS
EGG-Fusion	ZJU 2025	Gaussian surfel fusion, information-filter-based, real-time 24 FPS
Online-Mono-3DGS (MODP)	2025	ORB-SLAM3 tracking + Hierarchical Gaussian Management
ActiveSplat	Li 2025	Active mapping with 3DGS + Voronoi-based path planning
Open-S3SLAM	2026	Open-set semantic 3DGS SLAM for smartphones (ICRA 2026)
LEGS	2025	Language Embedded Gaussian Splats, real-time language-queryable 3D

System	Author/Year	Key Concepts
ConceptFusion	Jatavallabhula (MIT) 2023	CLIP features fused into 3D map, open-vocabulary language queries
LERF	Kerr 2023	Language Embedded Radiance Fields, DINO multi-scale, NeRF + CLIP
OpenScene	Peng (ETH) 2023	Language features back-projected to 3D point clouds
ConceptGraphs	Gu 2023	Open-vocabulary 3D Scene Graph, SAM + CLIP + LLM spatial relations
SpatialLLM	Mao 2025	Point cloud → LLM, structured indoor modeling as Python scripts

System	Author/Year	Key Concepts
ICP	Besl & McKay 1992	—
DTAM	Newcombe 2011	—
KinectFusion	Newcombe 2011	GPGPU, Tracking (project depth → 3D, surface normal, coarse-to-fine ICP), Mapping (volumetric integration, TSDF), Robust to small scene changes, Cannot model deformation, Map growth cubic, Room-size only
Double Window Optimisation	Strasdat 2011	—
Kintinuous	Whelan 2012	Volume shift, Geometric, Photometric, dBoW+SURF, Optimisation, Loop closure
RGBD-SLAM-V2	Endres 2013	Tracking (colour image, visual features, depth image, point cloud, transformation), Mapping (OctoMap 2013)
SLAM++	Salas-Moreno 2013	Object-oriented SLAM
DVO	Kerl 2013	Keyframe, Depth, Direct method, Optimisation, Loop closure
RTAB-Map	Labbé 2014	Loop closure, Map merge, Multi-session memory management
MRS-Map	Stuckler 2014	—
ElasticFusion	Whelan 2015	Active: frame-to-model tracking (photometric + geometric), joint optimisation, fused surfel-based model reconstruction · Inactive: local loop closure (model-to-model local surface, submodel separation), global loop closure (randomised fern encoding, non-rigid space deformation)
DynamicFusion	Newcombe 2015	6D motion field, Deformable scene
ORB-SLAM2	Mur-Artal 2016	Bundle adjustment, Sparse reconstruction
BundleFusion	Dai 2016	Local-to-global optimisation, Sparse RGB feature, Coarse global pose estimation, Fine pose refinement (geometric + photometric)
SemanticFusion	McCormac 2016	Deep Learning CNN, Deep Semantic SLAM
InfiniTAM v3	Prisacariu 2017	Tracking (scene raycast, depth image, RGB image), Relocalisation (random ferns), Mapping (TSDF reconstruction, voxel hashing, surfel reconstruction)
Fusion++	McCormac & Clark 2018	Deep Learning CNN, Mask-RCNN instance segmentation, Object-level SLAM, No prior, Object-level TSDF reconstruction
PointFusion / DenseFusion	Xu 2018 / Wang 2019	RGB-D object pose estimation, Tracking, Relocalisation, Loop closure detection
BAD SLAM	Schops 2019	Direct bundle adjustment, Deep Semantic SLAM
RTAB-Map v2	Labbé 2019	RGB-D/LiDAR, Light-source detection (2016)
MoreFusion	Wada & Sucar 2020	DL instance segmentation, Object-level volumetric fusion, Volumetric pose prediction, 3D scene reconstruction, Collision-based refinement, Semantic SLAM, Object pose estimation, CAD object fitting
NodeSLAM	Wada & Sucar 2020	Occupancy VAE, Object-level SLAM (→ also in Level 5 Latent Representation)
Kimera / 3D Dynamic Scene Graph	Rosinol 2020	Kimera-VIO, Kimera-Mesher, Kimera-PGMO, Kimera-Semantics, Kimera-DSG
DSP-SLAM	Wang (UCL) 2021	DeepSDF shape prior + ORB-SLAM2, object-level dense reconstruction (mono/stereo/LiDAR)

System	Author/Year	Key Concepts
NetVLAD	Arandjelovic 2016	VLAD, place recognition
SuperPoint	DeTone 2017	Homographic Adaptation, Self-supervised, VGG encoder + detector/descriptor heads
HardNet	Mishchuk 2017	Learned local descriptor
R2D2	Revaud 2019	Repeatable + Reliable detector/descriptor, explicit repeatability/reliability maps
KeyNet	Barroso-Laguna 2019	Learned keypoint detector
HF-Net	Sarlin 2019	Global feature, Local feature, Visual localization
SuperGlue	Sarlin 2020	Self/Cross-attention GNN, Sinkhorn optimal assignment, dustbin for outliers
DISK	Tyszkiewicz 2020	Policy gradient (RL) training, match success/failure as reward
Patch NetVLAD	Hausler 2021	Multi-scale patch-level VLAD
LoFTR	Sun 2021	Detector-free, Transformer coarse-to-fine dense matching
LightGlue	Lindenberger 2023	Adaptive depth/width, 5-10× faster than SuperGlue
XFeat	Potje 2024	0.3M params, 1400 FPS (RTX 4090), 64-dim descriptor, embedded-friendly
RoMA	Edstedt 2024	DINOv2 foundation feature + coarse-to-fine dense matching
DeDoDe	Edstedt 2024	Joint detect-and-describe in one stage
RoMA V2	Edstedt 2026	Improved RoMA

System	Author/Year	Key Concepts
MonoDepth	Godard 2016	Left-Right photometric consistency, self-supervised
MiDaS	Ranftl 2020	Multi-dataset mixing, scale-and-shift invariant loss, relative depth
DPT	Ranftl 2021	Dense Prediction Transformer (ViT backbone), global context
ZoeDepth	Bhat 2023	Zero-shot metric depth, Metric Bins Module
Metric3D	Yin 2023	Camera intrinsic-conditioned metric depth, Canonical Camera Space
Depth Anything	Yang 2024	62M images, foundation model for monocular depth
Depth Anything V2	Yang 2024	Improved with synthetic data, better edge preservation
Marigold	Ke 2024	Stable Diffusion for depth, fine detail, uncertainty via sampling
Align3r	Melou 2025	Video temporal consistency, DUSt3R-based, CVPR 2025 Highlight
Masked Depth Modeling (LingBot-Depth)	2026	Fixes RGB-D failures on glass/mirrors/metal

System	Author/Year	Key Concepts
FlowNet	Dosovitskiy 2015	First end-to-end deep optical flow (SimpleNet / CorrNet)
FlowNet 2.0	Ilg 2017	Stacked networks, classical-level accuracy
PWC-Net	Sun 2018	Pyramid-Warping-Cost volume, coarse-to-fine, 8.4M params
FlowNet3D	Liu 2019	Point cloud scene flow, PointNet++ based
RAFT	Teed 2020	All-Pairs Correlation + iterative ConvGRU update, ECCV Best Paper
RAFT-3D	Teed 2021	Scene flow (3D motion) from RAFT
FlowFormer	Huang 2022	Transformer on cost volume tokens, global context
SEA-RAFT	2024	Efficient RAFT variant for real-time

System	Author/Year	Key Concepts
PoseNet	Kendall 2015	CNN-based 6-DoF pose regression (APR), GoogLeNet backbone
DSAC	Brachmann 2017	Differentiable RANSAC, Scene Coordinate Regression (SCR)
DSAC++	Brachmann 2018	Self-supervision, RGB-D support
CNN Pose Regression Limitations	Sattler 2019	Pose regression ≈ image retrieval performance
LM-Reloc	von Stumberg 2020	Deep direct relocalization
DSAC*	Brachmann 2021	Improved learning stability
ACE	Brachmann 2023	Accelerated Coordinate Encoding, 5-min training per scene
ACE Zero	Brachmann 2024	Zero-shot SCR, no pre-built 3D map needed
ACE-G	Brachmann 2024	Generalizable SCR via cross-attention, new scenes without fine-tuning
ACE-SLAM	Tang 2024	Neural implicit real-time SLAM, network weights = map
hloc	Sarlin 2019+	Hierarchical Localization: coarse (NetVLAD) → fine (SuperGlue) pipeline

System	Author/Year	Key Concepts
YOLO (v1→v11)	Redmon 2016→2024	Real-time object detection, Ultralytics ecosystem
DETR	Carion 2020	Transformer detection, anchor-free, no NMS
RT-DETR	Lv (Baidu) 2023	Real-time DETR, YOLO-speed + Transformer quality
SAM	Kirillov 2023	Segment Anything, prompt-based, Foundation Model
SAM 2	Meta 2024	Video segmentation, Memory Attention, temporal consistency
Grounding DINO	Liu 2023	Text-prompted detection → SAM pipeline (Grounded SAM)
Open-YOLO 3D	Benseddik 2025	2D open-vocab detection → 3D instance seg, 16× faster

System	Author/Year	Key Concepts
BA-Net	Tang 2019	FPN + differentiable LM layer, end-to-end SfM (ICLR)
DROID-SLAM	Teed 2021	Dense optical flow + differentiable dense BA, all-pixels reprojection
DPVO	Teed 2023	Patch-based DROID-SLAM, 30+ FPS real-time
Theseus	Pineda (Meta) 2022	Differentiable nonlinear optimization library (PyTorch)
Lietorch	Teed 2021	Lie group operations for PyTorch (SE(3)/SO(3))

System	Author/Year	Key Concepts
SE-Sync	Rosen 2019	Certifiable pose graph optimization via SDP + Riemannian opt
TEASER++	Yang 2020	Point cloud registration, 90%+ outlier robust, TLS + Max Clique (T-RO/RSS 2020)
GNC	Yang 2020	Graduated Non-Convexity, continuation from convex → robust cost
QUASAR	Yang 2022	Certifiable rotation averaging, SDP + robust cost

System	Author/Year	Key Concepts
FutureMapping 1	Davison 2018	Computational structure of Spatial AI, GBP for SLAM
FutureMapping 2	Ortiz 2019	GBP as core Spatial AI primitive, visual intro to GBP
BA on Graph Processor	Ortiz 2020	Bundle Adjustment on Graphcore IPU, tile-based parallelism
DANCeRS	2023	GBP-based distributed consensus in robot swarms

System	Author/Year	Key Concepts
DeepVO	Wang 2017	Supervised learning
SfM-Learner	Zhou 2017	Unsupervised, deep depth + deep pose
DeMoN	Ummenhofer 2017	Depth + Motion from two frames, encoder-decoder
UndeepVO	Li 2018	Stereo self-supervised, absolute scale recovery
DeepTAM	Zhou 2018	Deep tracking and mapping, cost volume based
DeepV2D	Teed 2018	Iterative depth from video, differentiable geometry layers
Depth from Video in the Wild	Gordon 2019	Unconstrained video depth, learned camera intrinsics
Neural Ray Surfaces	Vasiljevic 2020	Learned ray surface model, non-pinhole cameras
GradSLAM	Murthy 2020	Differentiable SLAM framework (PyTorch, supports multiple SLAM backends)
DeepSLAM	Wang 2020	TrackingNet, MappingNet, LoopNet
MonoRec	Wimbauer 2021	Self-supervised monocular 3D reconstruction, moving objects
TANDEM	Koestler 2021	Real-time tracking + dense mapping via MVS depth, DSO-based
DROID-SLAM	Teed 2021	Dense BA + correlation, SOTA on TartanAir/EuRoC (→ see Differentiable BA)
DPVO	Teed 2023	Patch-based lightweight DROID (→ see Differentiable BA)

System	Author/Year	Key Concepts
CodeSLAM	Bloesch 2018	Depth as 128-dim latent code, photometric BA on codes + poses
SceneCode	Zhi 2019	Depth + semantic in single latent code, cross-modal constraints
DeepFactors	Czarnowski 2020	Probabilistic depth codes + factor graph, GPU 30+ FPS
NodeSLAM	Sucar 2020	Object-level DeepSDF codes, occupancy VAE per object
CodeMapping	Shao 2021	Sparse SLAM + learned dense mapping, hybrid approach

System	Author/Year	Key Concepts
NeRF	Mildenhall 2020	Neural Radiance Fields, novel view synthesis (foundational)
DIFIX3D+	2026	Single-step diffusion for 3D reconstruction artifact removal (post-processing)

2026.02.25 - #63 - Visual SLAM roadmap, GEAR-SONIC, AMD+Meta, Universal Beta Splatting, Reconstruct Anything, Qwen3.5, TinyClaw, Chroma, PointCloudCrafter, Splat Feature Solver #66

Description

Level 1: Beginner

Programming

Mathematics

Projective Geometry

Camera Device

Image Data

Level 2: Getting Familiar with SLAM

Programming

Image Processing

Local Feature Matching

Global Feature Matching

Feature Tracking

Multiple View Geometry

Outlier Rejection

Least Squares Optimisation

Motion Model

Observation Model

Factor Graph Optimisation

Mapping

Sensors

Evaluation

Next Levels

Level 3: Monocular Visual-SLAM

Key Concepts

Feature-based SLAM

Direct SLAM

Hybrid (Feature + Direct)

Learning-based SLAM

Foundation Model SLAM

SfM Tools

Neural Representation SLAM

NeRF-based

3DGS-based

Semantic / Language-Grounded SLAM

Level 4: RGB-D Visual-SLAM

RGB-D Camera Devices

GPGPU Programming

Systems

Level 5: Applying Deep Learning

A. Deep Frontend — Perception

Feature Detection & Matching

Depth Estimation

Optical Flow & Scene Flow

Camera Pose Regression & Relocalization

Object Detection & Segmentation for SLAM

B. Deep Backend — Optimization

Differentiable Bundle Adjustment

Certifiably Optimal Algorithms

Gaussian Belief Propagation & Graph Processors

C. End-to-End Deep VO / SLAM Systems

Self-supervised & Learned VO

Latent Representation SLAM

Neural Rendering (reference)

D. Scene Understanding

Benchmarks & Foundations

3D Scene Graph

Level 6: VIO / VINS

Key Concepts

Foundations

Filter-based

Optimization-based

Level 7: World Models & Spatial AI

World Models

Generative 3D

Vision-Language Models (VLM)

Vision-Language-Action Models (VLA)

Resources

Level 8: Stereo SLAM

Key Concepts

Systems

Level 9: Collaborative / Multi-Robot SLAM

Key Concepts

Systems

Level 10: LiDAR & Visual-LiDAR Fusion SLAM

Key Concepts

LiDAR / LiDAR-Inertial SLAM

Visual-LiDAR Fusion SLAM

Resources

System	Author/Year	Key Concepts
Hydra	Hughes (MIT SPARK) 2022	Real-time hierarchical Scene Graph (mesh→objects→places→rooms→buildings)
Hydra-Multi	Hughes 2023	Distributed multi-robot 3D Scene Graph
Clio	Maggio (MIT SPARK) 2024	Open-set task-driven Scene Graph, CLIP embeddings per node
Khronos	Schmid (MIT SPARK) 2024	Spatio-temporal Scene Graph, dynamic object history tracking
ConceptGraphs	Gu 2023	Open-vocabulary 3D Scene Graph, SAM + CLIP + LLM relations (→ also in L3 Semantic)

Resource	Author/Year	Key Concepts
📖 Introduction to Inertial Navigation	Woodman 2007	IMU fundamentals, coordinate frames, error sources — essential prerequisite
IMU Preintegration on Manifold	Forster 2015	On-manifold preintegration, bias correction without re-integration
Quaternion kinematics for error-state KF	Sola 2017	Quaternion math, error-state formulation

System	Author/Year	Key Concepts
MSCKF	Mourikis 2007	Multi-State Constraint KF, efficient VIO without landmarks in state
ROVIO	Bloesch 2015	Robocentric VIO, direct photometric tracking + EKF
OpenVINS	Geneva 2020	Open-source MSCKF, modular, extensible

System	Author/Year	Key Concepts
OKVIS	Leutenegger 2015	Keyframe-based, tightly-coupled, sliding window optimization
VINS-Mono	Qin 2018	Tightly-coupled, relocalization, loop closure, pose graph optimization
VINS-Fusion	Qin 2019	Stereo + GPS fusion extension
MAPLAB	Schneider 2018	Multi-session visual-inertial mapping framework
Kimera-VIO	Rosinol 2020	Fast VIO frontend for Kimera pipeline, structureless vision factors
Basalt	Usenko 2020	Non-linear recovery, visual-inertial odometry + mapping
ORB-SLAM3	Campos 2020	VIO mode, multi-map, IMU initialization
DM-VIO	von Stumberg 2022	Deep monocular VIO, delayed marginalization
OKVIS2	Leutenegger 2022	Multi-session, improved marginalization
AirVO	Xu 2023	Point-line VIO, illumination-robust
OKVIS2-X	Boche & Leutenegger 2025	Multi-sensor SLAM (Visual+Inertial+Depth+LiDAR+GNSS), dense volumetric occupancy maps, submapping for large-scale (9km+), EuRoC/Hilti22 SOTA

System	Author/Year	Key Concepts
GAIA-1	Wayve 2023	Driving World Model, action-conditioned future scene generation
Sora / DiT	OpenAI 2024	Diffusion Transformer, spacetime patches, emergent 3D understanding
NVIDIA Cosmos	NVIDIA 2026	World Foundation Model platform for Physical AI, synthetic data for AV/robots
World Labs / Marble	Fei-Fei Li 2026	3D world generation from images/video/text ($1B funding)
WorldVLA	Alibaba 2025	Autoregressive action world model, learns physics for action generation
SceneDINO	2025	Feed-forward unsupervised semantic scene completion

System	Author/Year	Key Concepts
CLIP	Radford (OpenAI) 2021	Contrastive image-text pretraining, 400M pairs, zero-shot
SigLIP	Zhai (Google) 2023	Sigmoid loss CLIP, more efficient, better at small model sizes
BLIP-2	Li (Salesforce) 2023	Q-Former bridges frozen LLM + image encoder
LLaVA	Liu 2023	LLaMA + vision, conversational VLM

System	Author/Year	Key Concepts
RT-2	Brohan (DeepMind) 2023	Robot actions as text tokens, emergent generalization
OpenVLA	Kim 2024	Open-source VLA, SigLIP + Llama 7B + Action Head
Navila	2024	Navigation-specialized VLA, SLAM integration for localization

System	Author/Year	Key Concepts
S-PTAM	Pire 2017	Stereo PTAM, ROS-compatible, real-time
ORB-SLAM2 (stereo)	Mur-Artal 2016	Stereo + RGB-D modes, loop closure, relocalization
StereoMSCKF	Sun 2018	MSCKF with stereo, efficient for resource-constrained platforms
RTAB-Map	Labbé 2019	Multi-sensor (stereo/RGB-D/LiDAR), memory management, large-scale
ORB-SLAM3 (stereo)	Campos 2020	Multi-map, Atlas, stereo + IMU
Stella-VSLAM	Community 2022	Open-source fork of OpenVSLAM, stereo support
LDSO	Gao 2018	Direct stereo odometry with loop closure (DSO extension)

System	Author/Year	Key Concepts
C2TAM	Riazuelo 2014	Cloud-based collaborative monocular SLAM
CCM-SLAM	Schmuck & Chli 2019	Centralized collaborative monocular SLAM, robust to comm failures
DOOR-SLAM	Lajoie 2020	Distributed, outlier-resilient SLAM with pairwise consistency
Kimera-Multi	Tian 2022	Distributed multi-robot metric-semantic SLAM, mesh reconstruction
Swarm-SLAM	Lajoie 2024	Decentralized, sparse, scalable C-SLAM, supports LiDAR/stereo/RGB-D
CoPeD-Advancing	Stathoulopoulos 2024	Multi-robot collaborative perception for autonomous exploration
MAPLAB 2.0	Cramariuc 2023	Multi-session, multi-robot visual-inertial mapping

System	Author/Year	Key Concepts
LOAM	Zhang 2014	LiDAR odometry and mapping (foundational), edge + planar features
SuMa	Behley (Bonn) 2018	Surfel-based LiDAR SLAM, projective ICP on range images
SuMa++	Chen (Bonn) 2019	SuMa + RangeNet++ semantics, semantic ICP weighting, dynamic object filtering
LIO-SAM	Shan 2020	Tightly-coupled LiDAR-inertial, factor graph, GPS fusion
FAST-LIO2	Xu 2022	Direct LiDAR-inertial, ikd-Tree, extremely fast
PIN-SLAM	Pan (Bonn) 2024	Neural point cloud LiDAR SLAM, point-to-SDF registration, elastic map deformation for loop closure