From c0acac871b547c72481e3504fa50ad267d0701af Mon Sep 17 00:00:00 2001 From: zhiqiang-hhhh Date: Tue, 7 Apr 2026 18:51:14 +0800 Subject: [PATCH 1/2] [doc](add) add pq on-disk vector search guide --- docs/ai/vector-search/pq-on-disk.md | 201 +++++++++++++++++ .../current/ai/vector-search/pq-on-disk.md | 204 ++++++++++++++++++ sidebars.ts | 1 + 3 files changed, 406 insertions(+) create mode 100644 docs/ai/vector-search/pq-on-disk.md create mode 100644 i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/pq-on-disk.md diff --git a/docs/ai/vector-search/pq-on-disk.md b/docs/ai/vector-search/pq-on-disk.md new file mode 100644 index 0000000000000..4c823f82de13d --- /dev/null +++ b/docs/ai/vector-search/pq-on-disk.md @@ -0,0 +1,201 @@ +--- +{ + "title": "PQ On-Disk", + "language": "en", + "description": "PQ On-Disk is an ANN index mode in Apache Doris for reranking small post-filter candidate sets, storing PQ codes on disk and using a dedicated chunk cache to reduce memory usage." +} +--- + + + +# PQ On-Disk in Apache Doris + +`pq_on_disk` is an ANN index type in Doris designed for reranking small candidate sets after scalar filtering. It stores PQ codes on disk in rowid order, keeps only the PQ codebook in memory, and computes approximate distances only for rows that have already passed the filter. + +Compared with `ivf` and `ivf_on_disk`, `pq_on_disk` is not a global ANN recall structure. It is designed for queries such as `WHERE user_id = ? ORDER BY l2_distance_approximate(...) LIMIT N`, where the filter first narrows the search scope to a relatively small candidate set and the vector index is then used for fast approximate reranking. + +## Why PQ On-Disk + +Some vector-search workloads do not need ANN to search the whole segment. Instead, they first use ordinary predicates such as `user_id`, `tag`, or other inverted-index filters to reduce the candidate set, and only then need fast Top-N vector ranking inside that filtered subset. + +`pq_on_disk` is designed for this operating point: + +- Work on filtered candidate sets, typically thousands to tens of thousands of rows. +- Keep memory footprint low by storing PQ codes on disk. +- Reuse standard SQL distance functions and ANN DDL. +- Avoid the overhead of maintaining a global IVF or graph structure when the candidate set is already known. + +## Scope and User Value + +Compared with other ANN index types in Doris, `pq_on_disk` focuses on a different problem: + +- `hnsw` and `ivf` are optimized for global ANN retrieval across large vector collections. +- `ivf_on_disk` keeps the IVF recall model but moves IVF lists to disk to save memory. +- `pq_on_disk` is optimized for post-filter approximate reranking on small candidate sets. + +This makes it useful when: + +- The query almost always includes a highly selective scalar filter. +- Rows for the same filter key have good locality. +- Full brute-force distance evaluation on the filtered rows is still too expensive. +- You want lower steady-state memory usage than an in-memory ANN structure. + +## User-Facing Interfaces + +### 1) Index DDL + +Use `index_type="pq_on_disk"` in ANN index properties. + +```sql +CREATE TABLE image_pool ( + user_id BIGINT NOT NULL, + photo_id BIGINT NOT NULL, + embedding ARRAY NOT NULL, + INDEX idx_emb (embedding) USING ANN PROPERTIES ( + "index_type" = "pq_on_disk", + "metric_type" = "l2_distance", + "dim" = "768", + "pq_m" = "96", + "pq_nbits" = "8" + ) +) ENGINE=OLAP +DUPLICATE KEY(user_id, photo_id) +DISTRIBUTED BY HASH(user_id) BUCKETS 8 +PROPERTIES ("replication_num" = "1"); +``` + +Notes: + +- `metric_type` supports `l2_distance` and `inner_product`. +- `dim` is required. +- `pq_m` is required. +- `dim` must be divisible by `pq_m`. +- `pq_nbits` is optional and defaults to `8`. +- Query syntax remains the same: `l2_distance_approximate` and `inner_product_approximate`. + +### 2) Typical query patterns + +Top-N reranking after filtering: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = 10001 +ORDER BY l2_distance_approximate(embedding, [0.12, 0.44, 0.33 /* ... */]) +LIMIT 20; +``` + +For inner-product search, sort in descending order: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = 10001 +ORDER BY inner_product_approximate(embedding, [0.12, 0.44, 0.33 /* ... */]) DESC +LIMIT 20; +``` + +Range search is also supported: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = 10001 + AND l2_distance_approximate(embedding, [0.12, 0.44, 0.33 /* ... */]) < 5.0 +ORDER BY photo_id; +``` + +The most important usage characteristic is that `pq_on_disk` is intended to work with filters. This is the main scenario where it differs from `ivf_on_disk`. + +### 3) BE cache configuration + +`pq_on_disk` uses a dedicated chunk cache for PQ code data: + +- `ann_index_pq_chunk_cache_limit` (default: `60%`) +- `ann_index_pq_chunk_cache_stale_sweep_time_sec` (default: `1800`) + +The percentage value of `ann_index_pq_chunk_cache_limit` is based on process-available memory (`mem_limit`), not total machine memory. + +## Parameters and Constraints + +### Index parameters + +| Property | Required | Default | Description | +|---|---|---|---| +| `index_type` | Yes | - | Must be `pq_on_disk`. | +| `metric_type` | Yes | - | `l2_distance` or `inner_product`. | +| `dim` | Yes | - | Vector dimension. | +| `pq_m` | Yes | - | Number of PQ subquantizers. Must divide `dim`. | +| `pq_nbits` | No | `8` | Number of bits per subquantizer code. | + +### Training behavior + +`pq_on_disk` needs enough rows to train the PQ codebook. The minimum training row count is: + +```text +(1 << pq_nbits) * 100 +``` + +Examples: + +- `pq_nbits = 8` requires at least `25600` training rows. +- `pq_nbits = 4` requires at least `1600` training rows. + +If a segment does not have enough rows to train the PQ index, Doris can fall back to brute-force search for that segment. + +## Observability + +`pq_on_disk` introduces a dedicated BE cache named `AnnIndexPqChunkCache`. + +When troubleshooting, first check whether queries are actually selective enough and whether the PQ chunk cache is large enough to avoid repeated disk reads on hot candidate ranges. + +## Usage Notes + +- `pq_on_disk` is best suited for selective filter + vector reranking, not global ANN recall. +- It shares the common ANN table constraints in Doris, such as vector column type and ANN expression usage. +- It supports both `l2_distance` and `inner_product`, including Top-N and range-search style predicates. +- Query result ordering follows the metric semantics: `l2_distance_approximate` uses ascending order, while `inner_product_approximate` uses descending order. +- Data locality matters. It works best when rows belonging to the same filter key are physically close, so PQ code reads are more sequential. +- For very small segments or very small training sets, the index may not be built and the query can fall back to brute force. + +## Best Practices + +1. Choose `pq_on_disk` when the query pattern is usually `filter first, rerank second`. +2. Keep the filter column selective. The smaller the post-filter candidate set, the more suitable `pq_on_disk` becomes. +3. Choose `pq_m` so that `dim / pq_m` is reasonable and easy to manage. A common starting point is to align `pq_m` with the dimensional decomposition you already use in other PQ-based systems. +4. Start with `pq_nbits = 8` unless you have strong reasons to trade recall for smaller code size. +5. Watch cache effectiveness and latency together. If repeated filtered queries are still I/O-heavy, increase `ann_index_pq_chunk_cache_limit` and retest. +6. Validate on real business data before production rollout, especially for recall quality under your actual filter distribution. + +## How to Choose Between `ivf_on_disk` and `pq_on_disk` + +Use `ivf_on_disk` when: + +- You need ANN to search across a large global vector collection. +- Your main tuning model is still `nlist` and `nprobe`. +- Query performance depends on probing a subset of IVF lists. + +Use `pq_on_disk` when: + +- The query already has a selective scalar filter. +- The candidate set after filtering is relatively small. +- You mainly need fast approximate reranking within filtered rows rather than global ANN recall. + +In short, `ivf_on_disk` is a disk-backed global ANN index, while `pq_on_disk` is a disk-backed post-filter reranking index. diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/pq-on-disk.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/pq-on-disk.md new file mode 100644 index 0000000000000..a414f344c6e0b --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/pq-on-disk.md @@ -0,0 +1,204 @@ +--- +{ + "title": "PQ On-Disk", + "language": "zh-CN", + "description": "PQ On-Disk 是 Apache Doris 面向过滤后小候选集向量重排场景提供的 ANN 索引形态,通过将 PQ codes 存储在磁盘并配合专用 chunk cache,在低内存占用下实现更高效的近似重排。" +} +--- + + + +# Apache Doris 中的 PQ On-Disk + +`pq_on_disk` 是 Doris 面向过滤后小候选集重排场景提供的 ANN 索引类型。它将 PQ codes 按 rowid 顺序存储在磁盘上,只将 PQ codebook 常驻内存,并仅对已经通过标量过滤的候选行计算近似距离。 + +与 `ivf`、`ivf_on_disk` 不同,`pq_on_disk` 不是一个面向全局召回的 ANN 结构。它更适合这类查询:`WHERE user_id = ? ORDER BY l2_distance_approximate(...) LIMIT N`。也就是先用过滤条件把候选集缩小,再对这个较小的候选集做快速近似向量重排。 + +## 为什么需要 PQ On-Disk + +有些向量检索场景并不需要 ANN 在整个 segment 上做全局搜索,而是先通过 `user_id`、`tag` 或倒排索引等普通过滤条件把候选行缩小到较小范围,然后才需要在这个过滤后的子集内做 Top-N 向量排序。 + +`pq_on_disk` 就是为这种工作模式设计的: + +- 面向过滤后的候选集,典型规模是几千到几万行。 +- 通过将 PQ codes 存储在磁盘上,降低常驻内存占用。 +- 继续复用 Doris 现有的 SQL 距离函数和 ANN DDL。 +- 当候选集已经比较明确时,避免维护全局 IVF 或图结构带来的额外开销。 + +## 对用户的价值 + +与 Doris 中其他 ANN 索引相比,`pq_on_disk` 解决的是另一类问题: + +- `hnsw` 和 `ivf` 更适合在大规模向量集合上做全局 ANN 召回。 +- `ivf_on_disk` 保留 IVF 的召回模型,只是把 IVF list 主体放到磁盘以节省内存。 +- `pq_on_disk` 则聚焦在过滤后小候选集上的近似重排。 + +它适合以下场景: + +- 查询几乎总是带有高选择性的标量过滤条件。 +- 相同过滤键对应的行具有较好的物理局部性。 +- 即使候选集已经被过滤缩小,暴力计算距离仍然开销较大。 +- 希望比内存型 ANN 结构有更低的常驻内存占用。 + +## 用户接口 + +### 1)建索引 DDL + +通过 `index_type="pq_on_disk"` 创建 ANN 索引。 + +```sql +CREATE TABLE image_pool ( + user_id BIGINT NOT NULL, + photo_id BIGINT NOT NULL, + embedding ARRAY NOT NULL, + INDEX idx_emb (embedding) USING ANN PROPERTIES ( + "index_type" = "pq_on_disk", + "metric_type" = "l2_distance", + "dim" = "768", + "pq_m" = "96", + "pq_nbits" = "8" + ) +) ENGINE=OLAP +DUPLICATE KEY(user_id, photo_id) +DISTRIBUTED BY HASH(user_id) BUCKETS 8 +PROPERTIES ("replication_num" = "1"); +``` + +说明: + +- `metric_type` 支持 `l2_distance` 和 `inner_product`。 +- `dim` 为必填参数。 +- `pq_m` 为必填参数。 +- `dim` 必须能够被 `pq_m` 整除。 +- `pq_nbits` 为可选参数,默认值为 `8`。 +- 查询语法保持不变,仍使用 `l2_distance_approximate` 和 `inner_product_approximate`。 + +### 2)典型查询模式 + +过滤后的 Top-N 重排: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = 10001 +ORDER BY l2_distance_approximate(embedding, [0.12, 0.44, 0.33 /* ... */]) +LIMIT 20; +``` + +如果使用内积,相应地按降序排序: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = 10001 +ORDER BY inner_product_approximate(embedding, [0.12, 0.44, 0.33 /* ... */]) DESC +LIMIT 20; +``` + +也支持 range search: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = 10001 + AND l2_distance_approximate(embedding, [0.12, 0.44, 0.33 /* ... */]) < 5.0 +ORDER BY photo_id; +``` + +`pq_on_disk` 最重要的使用特征,就是它本身就是为“带过滤条件的向量重排”设计的,这一点和 `ivf_on_disk` 有明显区别。 + +### 3)BE 缓存配置 + +`pq_on_disk` 使用专用的 PQ chunk cache: + +- `ann_index_pq_chunk_cache_limit`(默认:`60%`) +- `ann_index_pq_chunk_cache_stale_sweep_time_sec`(默认:`1800`) + +其中 `ann_index_pq_chunk_cache_limit` 的百分比基准是 BE 进程可用内存(受 `mem_limit` 约束),不是整机物理内存。 + +## 参数与约束 + +### 索引参数 + +| 属性 | 是否必填 | 默认值 | 说明 | +|---|---|---|---| +| `index_type` | 是 | - | 必须为 `pq_on_disk`。 | +| `metric_type` | 是 | - | `l2_distance` 或 `inner_product`。 | +| `dim` | 是 | - | 向量维度。 | +| `pq_m` | 是 | - | PQ 子量化器数量,必须整除 `dim`。 | +| `pq_nbits` | 否 | `8` | 每个子量化器编码使用的 bit 数。 | + +### 训练行为 + +`pq_on_disk` 需要足够的数据来训练 PQ codebook。最小训练行数公式为: + +```text +(1 << pq_nbits) * 100 +``` + +例如: + +- `pq_nbits = 8` 时,至少需要 `25600` 行训练数据。 +- `pq_nbits = 4` 时,至少需要 `1600` 行训练数据。 + +如果某个 segment 的数据量不足以训练 PQ 索引,Doris 可能会对该 segment 回退到暴力搜索。 + +## 可观测性 + +`pq_on_disk` 引入了专用的 BE 缓存 `AnnIndexPqChunkCache`。 + +排查性能问题时,建议优先确认两件事: + +- 查询是否真的具有足够高的过滤选择性。 +- PQ chunk cache 是否足够大,能够避免热点候选区间被重复从磁盘读取。 + +## 使用说明 + +- `pq_on_disk` 更适合“先过滤,再向量重排”的场景,不适合替代全局 ANN 召回索引。 +- 它与 Doris 现有 ANN 索引共享通用约束,例如向量列类型和 ANN 表达式的使用方式。 +- 它支持 `l2_distance` 和 `inner_product` 两种度量,也支持 Top-N 与 range search 风格的查询。 +- 查询结果的排序方向需要与度量语义一致:`l2_distance_approximate` 用升序,`inner_product_approximate` 用降序。 +- 数据局部性很重要。如果相同过滤键对应的行在物理上更连续,`pq_on_disk` 读取 PQ codes 时就更容易形成顺序 I/O。 +- 对于非常小的 segment 或训练数据不足的 segment,索引可能不会被真正构建,查询会回退到暴力搜索。 + +## 最佳实践 + +1. 当查询模式主要是“先过滤,后重排”时,优先考虑 `pq_on_disk`。 +2. 让过滤列尽可能具有较高选择性。过滤后的候选集越小,`pq_on_disk` 越能发挥优势。 +3. 选择 `pq_m` 时,先确保 `dim / pq_m` 合理,并尽量与现有 PQ 经验保持一致。 +4. 除非明确需要用更小 code size 换取更低精度,否则建议先从 `pq_nbits = 8` 开始。 +5. 联合观察缓存效果和查询延迟。如果同类过滤查询仍然频繁触发磁盘 I/O,可以提高 `ann_index_pq_chunk_cache_limit` 后重新测试。 +6. 在正式上线前,务必基于真实业务数据验证召回质量,尤其要关注真实过滤分布下的效果。 + +## 如何在 `ivf_on_disk` 和 `pq_on_disk` 之间选择 + +以下场景更适合 `ivf_on_disk`: + +- 需要在大规模全局向量集合上做 ANN 搜索。 +- 主要调优模型仍然是 `nlist` 和 `nprobe`。 +- 查询性能依赖于 IVF list 的探测与召回。 + +以下场景更适合 `pq_on_disk`: + +- 查询本身已经带有高选择性的标量过滤条件。 +- 过滤后的候选集规模相对较小。 +- 主要需求是在过滤后的候选行中做快速近似重排,而不是做全局 ANN 召回。 + +可以简单理解为:`ivf_on_disk` 是磁盘化的全局 ANN 索引,而 `pq_on_disk` 是磁盘化的过滤后近似重排索引。 diff --git a/sidebars.ts b/sidebars.ts index 5f05731a615d4..239a0b7562e2a 100644 --- a/sidebars.ts +++ b/sidebars.ts @@ -359,6 +359,7 @@ const sidebars: SidebarsConfig = { 'ai/vector-search/hnsw', 'ai/vector-search/ivf', 'ai/vector-search/ivf-on-disk', + 'ai/vector-search/pq-on-disk', 'ai/vector-search/index-management', 'ai/vector-search/resource-estimation', 'ai/vector-search/quantization-survey', From d0fcd8262dd7950646a95905a4aec2bf791fb096 Mon Sep 17 00:00:00 2001 From: zhiqiang-hhhh Date: Thu, 23 Apr 2026 15:37:25 +0800 Subject: [PATCH 2/2] DDD --- docs/ai/vector-search/overview.md | 10 +- docs/ai/vector-search/pq-on-disk.md | 215 +++++++++---- .../current/ai/vector-search/overview.md | 10 +- .../current/ai/vector-search/pq-on-disk.md | 236 ++++++++++----- .../version-4.x/ai/vector-search/overview.md | 10 +- .../ai/vector-search/pq-on-disk.md | 286 ++++++++++++++++++ .../version-4.x/ai/vector-search/overview.md | 10 +- .../ai/vector-search/pq-on-disk.md | 286 ++++++++++++++++++ versioned_sidebars/version-4.x-sidebars.json | 2 + 9 files changed, 907 insertions(+), 158 deletions(-) create mode 100644 i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/pq-on-disk.md create mode 100644 versioned_docs/version-4.x/ai/vector-search/pq-on-disk.md diff --git a/docs/ai/vector-search/overview.md b/docs/ai/vector-search/overview.md index d04e99932cd05..8dbec97274113 100644 --- a/docs/ai/vector-search/overview.md +++ b/docs/ai/vector-search/overview.md @@ -58,22 +58,22 @@ PROPERTIES ( ); ``` -- index_type: `hnsw` (for [Hierarchical Navigable Small World](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world)), `ivf` (for inverted file), or `ivf_on_disk` (for IVF with inverted lists stored on disk and served through cache) +- index_type: `hnsw` (for [Hierarchical Navigable Small World](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world)), `ivf` (for inverted file), `ivf_on_disk` (for IVF with inverted lists stored on disk and served through cache), or `pq_on_disk` (for filter-first reranking accelerated by PQ-encoded vectors stored on disk) - metric_type: `l2_distance` means using L2 distance as the distance function - dim: `128` means the vector dimension is 128 - quantizer: `flat` means each vector dimension is stored as original float32 | Parameter | Required | Supported/Options | Default | Description | |-----------|----------|-------------------|---------|-------------| -| `index_type` | Yes | `hnsw`, `ivf`, `ivf_on_disk` | (none) | ANN index algorithm. Supports HNSW, in-memory IVF, and IVF On-Disk. | +| `index_type` | Yes | `hnsw`, `ivf`, `ivf_on_disk`, `pq_on_disk` | (none) | ANN index algorithm. Supports HNSW, in-memory IVF, IVF On-Disk, and PQ On-Disk for selective filter-first reranking. | | `metric_type` | Yes | `l2_distance`, `inner_product` | (none) | Vector similarity/distance metric. L2 = Euclidean; inner_product can approximate cosine if vectors are normalized. | | `dim` | Yes | Positive integer (> 0) | (none) | Vector dimension. All imported vectors must match or an error is raised. | | `nlist` | No | Positive integer | `1024` | IVF inverted-list count. Effective when `index_type=ivf` or `index_type=ivf_on_disk`; larger values may improve recall/speed trade-offs but increase build overhead. | | `max_degree` | No | Positive integer | `32` | HNSW M (max neighbors per node). Affects index memory and search performance. | | `ef_construction` | No | Positive integer | `40` | HNSW efConstruction (candidate queue size during build). Larger gives better quality but slower build. | | `quantizer` | No | `flat`, `sq8`, `sq4`, `pq` | `flat` | Vector encoding/quantization: `flat` = raw; `sq8`/`sq4` = scalar quantization (8/4 bit), `pq` = product quantization to reduce memory. | -| `pq_m` | Required when 'quantizer=pq' | Positive integer | (none) | Specifies how many subvectors are used (vector dimension dim must be divisible by pq_m). | -| `pq_nbits` | Required when 'quantizer=pq' | Positive integer | (none) | The number of bits used to represent each subvector, in faiss pq_nbits is generally required to be no greater than 24. | +| `pq_m` | Required when `quantizer=pq` or `index_type=pq_on_disk` | Positive integer | (none) | Number of subvectors. The vector dimension `dim` must be divisible by `pq_m`. | +| `pq_nbits` | Required when `quantizer=pq`; optional when `index_type=pq_on_disk` | Positive integer | `8` for `pq_on_disk` | Number of bits used to represent each subvector. In Faiss, `pq_nbits` is generally required to be no greater than 24. | ## If You Need Cosine Similarity @@ -313,6 +313,8 @@ On 768-D Cohere-MEDIUM-1M and Cohere-LARGE-10M datasets, SQ8 reduces index size Quantization introduces extra build-time overhead because each distance computation must decode quantized values. For 128-D vectors, build time increases with row count; SQ vs. FLAT can be up to ~10× slower to build. +For workloads dominated by highly selective filters such as `tenant_id = ?` or `user_id = ?`, Doris also provides [`pq_on_disk`](./pq-on-disk.md). Unlike global ANN structures such as HNSW or IVF, `pq_on_disk` is designed to accelerate vector reranking inside the filtered subset by using PQ-encoded vectors stored on disk. This makes it especially useful for multi-tenant vector search, where global ANN structures built on mixed-tenant segments may suffer recall degradation after tenant filtering. + Similarly, Doris also supports product quantization, but note that when using PQ, additional parameters need to be provided: - `pq_m`: Indicates how many sub-vectors to split the original high-dimensional vector into (vector dimension dim must be divisible by pq_m). diff --git a/docs/ai/vector-search/pq-on-disk.md b/docs/ai/vector-search/pq-on-disk.md index 4c823f82de13d..1c6cd204a6379 100644 --- a/docs/ai/vector-search/pq-on-disk.md +++ b/docs/ai/vector-search/pq-on-disk.md @@ -2,7 +2,7 @@ { "title": "PQ On-Disk", "language": "en", - "description": "PQ On-Disk is an ANN index mode in Apache Doris for reranking small post-filter candidate sets, storing PQ codes on disk and using a dedicated chunk cache to reduce memory usage." + "description": "PQ On-Disk is a disk-backed vector reranking mode in Apache Doris. It is designed for selective filter-first workloads such as multi-tenant vector search, and uses PQ-encoded vectors to accelerate brute-force distance evaluation on filtered rows." } --- @@ -27,35 +27,110 @@ under the License. # PQ On-Disk in Apache Doris -`pq_on_disk` is an ANN index type in Doris designed for reranking small candidate sets after scalar filtering. It stores PQ codes on disk in rowid order, keeps only the PQ codebook in memory, and computes approximate distances only for rows that have already passed the filter. +`pq_on_disk` is a vector index mode in Apache Doris for **filter-first vector search**. It stores Product Quantization (PQ) codes on disk, keeps only the PQ codebook and hot chunks in memory, and uses the compressed vectors to accelerate brute-force-style distance evaluation on rows that have already passed scalar filtering. -Compared with `ivf` and `ivf_on_disk`, `pq_on_disk` is not a global ANN recall structure. It is designed for queries such as `WHERE user_id = ? ORDER BY l2_distance_approximate(...) LIMIT N`, where the filter first narrows the search scope to a relatively small candidate set and the vector index is then used for fast approximate reranking. +This feature is especially useful in **multi-tenant vector search**. In many SaaS-style workloads, vectors from many tenants are stored together in the same segment. If you build a global `hnsw` or `ivf` index on that mixed data and then query with predicates such as `WHERE tenant_id = ?`, the ANN recall can degrade significantly because the global recall structure was built across all tenants rather than for one tenant's local subset. `pq_on_disk` avoids this problem by not depending on a global cross-tenant recall structure. Instead, Doris first applies the tenant filter, then uses PQ codes to accelerate vector scoring inside the filtered subset. -## Why PQ On-Disk +## When to Use PQ On-Disk -Some vector-search workloads do not need ANN to search the whole segment. Instead, they first use ordinary predicates such as `user_id`, `tag`, or other inverted-index filters to reduce the candidate set, and only then need fast Top-N vector ranking inside that filtered subset. +Use `pq_on_disk` when your query pattern is usually: -`pq_on_disk` is designed for this operating point: +```sql +WHERE +ORDER BY l2_distance_approximate(...) LIMIT N +``` + +Typical examples include: + +- `WHERE tenant_id = ?` +- `WHERE user_id = ?` +- `WHERE category_id = ? AND status = 'active'` +- `WHERE tag MATCH_ANY '...' + ORDER BY l2_distance_approximate(...) LIMIT N` + +This is a different operating point from global ANN search: + +- `hnsw` and `ivf` are designed for **global ANN recall** across a large vector collection. +- `ivf_on_disk` keeps the IVF recall model but moves the main IVF data to disk to reduce memory pressure. +- `pq_on_disk` is designed for **filtered-subset reranking**, where the candidate set is already narrowed down by ordinary predicates and Doris needs a faster way to score those rows. + +## Why It Helps in Multi-Tenant Search + +Suppose a segment contains vectors from 10,000 tenants. A global HNSW or IVF index is built over all rows in the segment. If the query is: + +```sql +SELECT doc_id +FROM tenant_embeddings +WHERE tenant_id = 10001 +ORDER BY l2_distance_approximate(embedding, ) +LIMIT 20; +``` + +The query only cares about one tenant's rows, but the global ANN structure was trained or connected using vectors from all tenants. The nearest paths, graph edges, or IVF partitions that are good for global recall are not necessarily good for recall **after tenant filtering**. + +`pq_on_disk` addresses this case differently: + +1. Doris first applies the scalar predicate such as `tenant_id = 10001`. +2. It obtains a filtered candidate set for that tenant. +3. Instead of computing full float32 brute-force distances on every filtered row, Doris uses PQ-encoded vectors to evaluate distances much faster. +4. PQ code data is read from disk in rowid order and reused through a dedicated chunk cache. + +As a result, `pq_on_disk` is often a better fit than global ANN structures when: + +- the filter is highly selective, +- recall under post-filter/global ANN is unstable, +- and full brute-force over raw vectors is still too expensive. + +## Quick Start + +### Create a table + +The following example uses `tenant_id` as the main filter column: + +```sql +CREATE TABLE tenant_embeddings ( + tenant_id BIGINT NOT NULL, + doc_id BIGINT NOT NULL, + embedding ARRAY NOT NULL, + INDEX idx_embedding (embedding) USING ANN PROPERTIES ( + "index_type" = "pq_on_disk", + "metric_type" = "l2_distance", + "dim" = "768", + "pq_m" = "96", + "pq_nbits" = "8" + ) +) ENGINE=OLAP +DUPLICATE KEY(tenant_id, doc_id) +DISTRIBUTED BY HASH(tenant_id) BUCKETS 8 +PROPERTIES ( + "replication_num" = "1" +); +``` + +### Basic query -- Work on filtered candidate sets, typically thousands to tens of thousands of rows. -- Keep memory footprint low by storing PQ codes on disk. -- Reuse standard SQL distance functions and ANN DDL. -- Avoid the overhead of maintaining a global IVF or graph structure when the candidate set is already known. +```sql +SELECT doc_id, + l2_distance_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) AS score +FROM tenant_embeddings +WHERE tenant_id = 10001 +ORDER BY score ASC +LIMIT 20; +``` -## Scope and User Value +This query pattern is the primary target of `pq_on_disk`: filter first, then do fast vector Top-N inside the filtered rows. -Compared with other ANN index types in Doris, `pq_on_disk` focuses on a different problem: +## How PQ On-Disk Works -- `hnsw` and `ivf` are optimized for global ANN retrieval across large vector collections. -- `ivf_on_disk` keeps the IVF recall model but moves IVF lists to disk to save memory. -- `pq_on_disk` is optimized for post-filter approximate reranking on small candidate sets. +At a high level: -This makes it useful when: +1. Doris trains a PQ codebook for the segment. +2. Raw vectors are encoded into compact PQ codes. +3. PQ codes are stored on disk in rowid order. +4. At query time, Doris first evaluates ordinary predicates. +5. For rows that survive filtering, Doris loads the corresponding PQ chunks and computes approximate distances using PQ codes instead of full raw vectors. -- The query almost always includes a highly selective scalar filter. -- Rows for the same filter key have good locality. -- Full brute-force distance evaluation on the filtered rows is still too expensive. -- You want lower steady-state memory usage than an in-memory ANN structure. +So `pq_on_disk` is best understood as **PQ-accelerated filtered brute-force**, rather than a global ANN recall structure like HNSW or IVF. ## User-Facing Interfaces @@ -81,24 +156,25 @@ DISTRIBUTED BY HASH(user_id) BUCKETS 8 PROPERTIES ("replication_num" = "1"); ``` -Notes: - -- `metric_type` supports `l2_distance` and `inner_product`. -- `dim` is required. -- `pq_m` is required. -- `dim` must be divisible by `pq_m`. -- `pq_nbits` is optional and defaults to `8`. -- Query syntax remains the same: `l2_distance_approximate` and `inner_product_approximate`. - ### 2) Typical query patterns -Top-N reranking after filtering: +Top-N after filtering: ```sql SELECT photo_id FROM image_pool WHERE user_id = 10001 -ORDER BY l2_distance_approximate(embedding, [0.12, 0.44, 0.33 /* ... */]) +ORDER BY l2_distance_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) ASC +LIMIT 20; +``` + +Prepared-statement style query: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = ? +ORDER BY l2_distance_approximate(embedding, CAST(? AS ARRAY)) ASC LIMIT 20; ``` @@ -108,7 +184,7 @@ For inner-product search, sort in descending order: SELECT photo_id FROM image_pool WHERE user_id = 10001 -ORDER BY inner_product_approximate(embedding, [0.12, 0.44, 0.33 /* ... */]) DESC +ORDER BY inner_product_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) DESC LIMIT 20; ``` @@ -118,21 +194,10 @@ Range search is also supported: SELECT photo_id FROM image_pool WHERE user_id = 10001 - AND l2_distance_approximate(embedding, [0.12, 0.44, 0.33 /* ... */]) < 5.0 + AND l2_distance_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) < 500.0 ORDER BY photo_id; ``` -The most important usage characteristic is that `pq_on_disk` is intended to work with filters. This is the main scenario where it differs from `ivf_on_disk`. - -### 3) BE cache configuration - -`pq_on_disk` uses a dedicated chunk cache for PQ code data: - -- `ann_index_pq_chunk_cache_limit` (default: `60%`) -- `ann_index_pq_chunk_cache_stale_sweep_time_sec` (default: `1800`) - -The percentage value of `ann_index_pq_chunk_cache_limit` is based on process-available memory (`mem_limit`), not total machine memory. - ## Parameters and Constraints ### Index parameters @@ -160,42 +225,62 @@ Examples: If a segment does not have enough rows to train the PQ index, Doris can fall back to brute-force search for that segment. +## BE Cache Configuration + +`pq_on_disk` uses a dedicated chunk cache for PQ code data: + +- `ann_index_pq_chunk_cache_limit` (default: `60%`) +- `ann_index_pq_chunk_cache_stale_sweep_time_sec` (default: `1800`) + +The percentage value of `ann_index_pq_chunk_cache_limit` is based on process-available memory (`mem_limit`), not total machine memory. + ## Observability `pq_on_disk` introduces a dedicated BE cache named `AnnIndexPqChunkCache`. -When troubleshooting, first check whether queries are actually selective enough and whether the PQ chunk cache is large enough to avoid repeated disk reads on hot candidate ranges. +When troubleshooting, check the following first: + +- Whether the query is actually selective enough. +- Whether the filtered rows have good locality. +- Whether the PQ chunk cache is large enough to avoid repeated disk reads. +- Whether some segments are falling back to brute force because they do not have enough rows for PQ training. ## Usage Notes -- `pq_on_disk` is best suited for selective filter + vector reranking, not global ANN recall. -- It shares the common ANN table constraints in Doris, such as vector column type and ANN expression usage. -- It supports both `l2_distance` and `inner_product`, including Top-N and range-search style predicates. -- Query result ordering follows the metric semantics: `l2_distance_approximate` uses ascending order, while `inner_product_approximate` uses descending order. -- Data locality matters. It works best when rows belonging to the same filter key are physically close, so PQ code reads are more sequential. -- For very small segments or very small training sets, the index may not be built and the query can fall back to brute force. +- `pq_on_disk` is intended for **filter-first** workloads, not for global ANN recall across the whole segment. +- It is particularly suitable for **multi-tenant vector search** where rows from many tenants are mixed in the same segment. +- It supports both `l2_distance` and `inner_product`, including Top-N and range-search style queries. +- Query result ordering must match metric semantics: `l2_distance_approximate` uses ascending order, while `inner_product_approximate` uses descending order. +- Data locality matters. It works best when rows for the same filter key are physically close so PQ chunk reads are more sequential. +- For very small segments or insufficient training data, Doris may not build the PQ index and can fall back to brute force. ## Best Practices -1. Choose `pq_on_disk` when the query pattern is usually `filter first, rerank second`. -2. Keep the filter column selective. The smaller the post-filter candidate set, the more suitable `pq_on_disk` becomes. -3. Choose `pq_m` so that `dim / pq_m` is reasonable and easy to manage. A common starting point is to align `pq_m` with the dimensional decomposition you already use in other PQ-based systems. -4. Start with `pq_nbits = 8` unless you have strong reasons to trade recall for smaller code size. -5. Watch cache effectiveness and latency together. If repeated filtered queries are still I/O-heavy, increase `ann_index_pq_chunk_cache_limit` and retest. -6. Validate on real business data before production rollout, especially for recall quality under your actual filter distribution. +1. Choose `pq_on_disk` when the query pattern is usually **filter first, rerank second**. +2. Prefer it for **tenant-aware retrieval** such as `WHERE tenant_id = ? ORDER BY ... LIMIT N`. +3. Keep the filter column selective. The smaller the filtered candidate set, the more suitable `pq_on_disk` becomes. +4. Start with `pq_nbits = 8` unless you intentionally want a smaller code size at the cost of recall. +5. Choose `pq_m` so that `dim / pq_m` is reasonable for your model dimension and business recall target. +6. Use prepared statements for 768-D and higher query vectors to reduce SQL parsing overhead. +7. Validate on real business distributions, especially when tenant sizes are very uneven. + +## How to Choose Between `hnsw`, `ivf_on_disk`, and `pq_on_disk` + +Use `hnsw` when: -## How to Choose Between `ivf_on_disk` and `pq_on_disk` +- You need high-recall global ANN search. +- Query latency is the top priority and enough memory is available. Use `ivf_on_disk` when: -- You need ANN to search across a large global vector collection. -- Your main tuning model is still `nlist` and `nprobe`. -- Query performance depends on probing a subset of IVF lists. +- You still need a global IVF-style ANN recall model. +- Memory is limited, but the query still searches a large global vector collection. Use `pq_on_disk` when: -- The query already has a selective scalar filter. -- The candidate set after filtering is relatively small. -- You mainly need fast approximate reranking within filtered rows rather than global ANN recall. +- The query already has a highly selective scalar filter. +- Rows from different tenants or users are mixed in the same segment. +- Global ANN recall under tenant/user filtering is poor. +- You want to accelerate filtered brute-force scoring with compressed vectors. -In short, `ivf_on_disk` is a disk-backed global ANN index, while `pq_on_disk` is a disk-backed post-filter reranking index. +In short, `pq_on_disk` is not a replacement for all ANN structures. It is the right choice when the main problem is **efficient vector reranking inside a filtered subset**, especially in multi-tenant workloads. \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/overview.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/overview.md index de02c0eae3323..604e91aed1c60 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/overview.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/overview.md @@ -49,7 +49,7 @@ PROPERTIES ( "replication_num" = "1" ); ``` -- index_type: 可选 `hnsw`([Hierarchical Navigable Small World 算法](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world))、`ivf`(倒排文件索引)或 `ivf_on_disk`(倒排列表落盘并通过缓存提供查询能力的 IVF) +- index_type: 可选 `hnsw`([Hierarchical Navigable Small World 算法](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world))、`ivf`(倒排文件索引)、`ivf_on_disk`(倒排列表落盘并通过缓存提供查询能力的 IVF)或 `pq_on_disk`(将 PQ 编码向量落盘、用于过滤后向量重排加速) - metric_type: l2_distance 表示使用 L2 距离作为距离函数 - dim: 128 表示向量维度为 128 - quantizer: flat 表示按原始 float32 存储各维度 @@ -57,15 +57,15 @@ PROPERTIES ( | 参数 | 是否必填 | 支持/可选值 | 默认值 | 说明 | |------|----------|-------------|--------|------| -| `index_type` | 是 | 支持:`hnsw`、`ivf`、`ivf_on_disk` | (无) | 指定所使用的 ANN 索引算法。当前支持 HNSW、内存 IVF 和 IVF On-Disk。 | +| `index_type` | 是 | 支持:`hnsw`、`ivf`、`ivf_on_disk`、`pq_on_disk` | (无) | 指定所使用的 ANN 索引算法。当前支持 HNSW、内存 IVF、IVF On-Disk,以及面向高选择性过滤后重排的 PQ On-Disk。 | | `metric_type` | 是 | `l2_distance`,`inner_product` | (无) | 指定向量相似度/距离度量方式。L2 为欧氏距离,inner_product 可用于余弦相似时需先归一化向量。 | | `dim` | 是 | 正整数 (> 0) | (无) | 指定向量维度,后续导入的所有向量的维度必须与此一致,否则报错。 | | `nlist` | 否 | 正整数 | `1024` | IVF 的倒排桶数量。在 `index_type=ivf` 或 `index_type=ivf_on_disk` 时生效;取值越大通常有助于召回率/速度权衡,但会增加构建开销。 | | `max_degree` | 否 | 正整数 | `32` | HNSW 图中单个节点的最大邻居数(M),影响索引内存与搜索性能。 | | `ef_construction` | 否 | 正整数 | `40` | HNSW 构建阶段的候选队列大小(efConstruction),越大构图质量越好但构建更慢。 | | `quantizer` | 否 | `flat`,`sq8`,`sq4`, `pq` | `flat` | 指定向量编码/量化方式:`flat` 为原始存储,`sq8`/`sq4` 为标量量化(8/4 bit), `pq` 为乘积量化。 | -| `pq_m` | 'quantizer=pq' 时需要指定 | 正整数 | (无) | 指定将原始的高维向量分割成多少个子向量(向量维度 dim 必须能被 pq_m 整除)。 | -| `pq_nbits` | 'quantizer=pq' 时需要指定 | 正整数 | (无) | 指定每个子向量量化的比特数, 它决定了每个子空间码本的大小(k = 2 ^ pq_nbits), 在faiss中pq_nbits值一般要求不大于24。 | +| `pq_m` | `quantizer=pq` 或 `index_type=pq_on_disk` 时需要指定 | 正整数 | (无) | 指定将原始的高维向量分割成多少个子向量,向量维度 `dim` 必须能被 `pq_m` 整除。 | +| `pq_nbits` | `quantizer=pq` 时需要指定;`index_type=pq_on_disk` 时可选 | 正整数 | `pq_on_disk` 默认 `8` | 指定每个子向量量化的比特数。它决定了每个子空间码本的大小(k = 2 ^ pq_nbits),在 Faiss 中一般要求不大于 24。 | ## 如果业务需要使用 Cosine 相似度 @@ -293,6 +293,8 @@ PROPERTIES ( 量化会带来额外构建开销,原因是构建阶段需要大量距离计算,且每次计算需对量化值解码。以 128 维向量为例,随行数增长构建时间上升,SQ 相比 FLAT 可能引入约 10 倍构建成本。 +对于以 `tenant_id = ?`、`user_id = ?` 等高选择性过滤为主的查询,Doris 还提供了 [`pq_on_disk`](./pq-on-disk.md)。它不像 HNSW / IVF 那样构建面向全局召回的结构,而是通过磁盘上的 PQ 编码向量,加速过滤后候选集上的向量重排。这使它在多租户向量检索场景下尤其有价值:当一个 segment 中混合了多个租户的数据时,全局 ANN 结构在指定租户后可能召回下降,而 `pq_on_disk` 更适合这种“先过滤、后重排”的模式。 + 类似的, Doris也支持乘积量化, 不过需要注意的是在使用PQ时需要提供额外的参数: - `pq_m`: 表示将原始的高维向量分割成多少个子向量(向量维度 dim 必须能被 pq_m 整除)。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/pq-on-disk.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/pq-on-disk.md index a414f344c6e0b..86f31062b43d9 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/pq-on-disk.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/pq-on-disk.md @@ -2,7 +2,7 @@ { "title": "PQ On-Disk", "language": "zh-CN", - "description": "PQ On-Disk 是 Apache Doris 面向过滤后小候选集向量重排场景提供的 ANN 索引形态,通过将 PQ codes 存储在磁盘并配合专用 chunk cache,在低内存占用下实现更高效的近似重排。" + "description": "PQ On-Disk 是 Apache Doris 面向过滤优先向量检索场景提供的向量索引形态,特别适用于多租户检索:通过在过滤后的候选集上使用 PQ 编码向量加速暴力距离计算,在较低内存占用下获得更稳定的效果。" } --- @@ -27,41 +27,116 @@ under the License. # Apache Doris 中的 PQ On-Disk -`pq_on_disk` 是 Doris 面向过滤后小候选集重排场景提供的 ANN 索引类型。它将 PQ codes 按 rowid 顺序存储在磁盘上,只将 PQ codebook 常驻内存,并仅对已经通过标量过滤的候选行计算近似距离。 +`pq_on_disk` 是 Apache Doris 面向**过滤优先(filter-first)向量检索**场景提供的一种向量索引模式。它将 Product Quantization(PQ)编码后的向量存储在磁盘上,仅将 PQ codebook 和热点 chunk 保留在内存中,并在标量过滤完成后,利用压缩后的向量加速过滤结果上的暴力距离计算。 -与 `ivf`、`ivf_on_disk` 不同,`pq_on_disk` 不是一个面向全局召回的 ANN 结构。它更适合这类查询:`WHERE user_id = ? ORDER BY l2_distance_approximate(...) LIMIT N`。也就是先用过滤条件把候选集缩小,再对这个较小的候选集做快速近似向量重排。 +这个特性尤其适合**多租户向量检索**。在很多 SaaS 类业务中,不同租户的向量会被写入同一个 segment。如果直接在这些混合数据上构建全局 `hnsw` 或 `ivf` 索引,再执行 `WHERE tenant_id = ?` 这类查询,召回率往往会明显下降,因为全局召回结构是基于所有租户的混合数据构建的,而不是针对某一个租户的局部子集。`pq_on_disk` 不依赖这样的全局跨租户召回结构,而是先按租户过滤,再在过滤后的子集上通过 PQ 编码向量加速排序,因此更适合多租户场景。 -## 为什么需要 PQ On-Disk +## 适用场景 -有些向量检索场景并不需要 ANN 在整个 segment 上做全局搜索,而是先通过 `user_id`、`tag` 或倒排索引等普通过滤条件把候选行缩小到较小范围,然后才需要在这个过滤后的子集内做 Top-N 向量排序。 +当查询模式通常是下面这样时,优先考虑 `pq_on_disk`: -`pq_on_disk` 就是为这种工作模式设计的: +```sql +WHERE <高选择性过滤条件> +ORDER BY l2_distance_approximate(...) LIMIT N +``` + +常见例子包括: + +- `WHERE tenant_id = ?` +- `WHERE user_id = ?` +- `WHERE category_id = ? AND status = 'active'` +- `WHERE tag MATCH_ANY '...' + ORDER BY l2_distance_approximate(...) LIMIT N` + +这和全局 ANN 的目标不同: + +- `hnsw` 和 `ivf` 更适合在大规模向量集合上做**全局 ANN 召回**。 +- `ivf_on_disk` 仍然保留 IVF 的全局召回模型,只是将主要索引数据落盘以降低内存压力。 +- `pq_on_disk` 聚焦的是**过滤后子集上的向量重排**,即候选集已经被普通谓词显著缩小,Doris 只需要更快地对这些候选行做向量打分。 + +## 为什么它适合多租户检索 + +假设一个 segment 中混合存储了 10,000 个租户的向量。如果在这些数据上构建全局 HNSW 或 IVF 索引,而查询是: + +```sql +SELECT doc_id +FROM tenant_embeddings +WHERE tenant_id = 10001 +ORDER BY l2_distance_approximate(embedding, ) +LIMIT 20; +``` + +这个查询只关心某一个租户的数据,但全局 ANN 结构的训练、聚类或图连接都基于所有租户的混合向量。对于“全局召回”有效的图路径、邻接关系或 IVF 分桶,并不一定适合“租户过滤之后”的局部召回,因此很容易出现指定租户后召回率下降的问题。 + +`pq_on_disk` 的处理方式不同: + +1. Doris 先执行 `tenant_id = 10001` 这样的标量过滤。 +2. 得到该租户对应的候选集。 +3. 不再依赖全局 ANN 结构在这个子集内做召回,而是使用 PQ 编码向量更快地计算这些候选行的距离。 +4. PQ code 按 rowid 顺序存储在磁盘,并通过专用 chunk cache 做复用。 + +因此,当满足以下条件时,`pq_on_disk` 往往比全局 ANN 结构更合适: -- 面向过滤后的候选集,典型规模是几千到几万行。 -- 通过将 PQ codes 存储在磁盘上,降低常驻内存占用。 -- 继续复用 Doris 现有的 SQL 距离函数和 ANN DDL。 -- 当候选集已经比较明确时,避免维护全局 IVF 或图结构带来的额外开销。 +- 过滤条件具有高选择性; +- 租户过滤后的召回稳定性比全局 ANN 更重要; +- 原始 float32 向量上的暴力距离计算仍然代价较高。 -## 对用户的价值 +## 快速开始 -与 Doris 中其他 ANN 索引相比,`pq_on_disk` 解决的是另一类问题: +### 建表 -- `hnsw` 和 `ivf` 更适合在大规模向量集合上做全局 ANN 召回。 -- `ivf_on_disk` 保留 IVF 的召回模型,只是把 IVF list 主体放到磁盘以节省内存。 -- `pq_on_disk` 则聚焦在过滤后小候选集上的近似重排。 +下面的例子使用 `tenant_id` 作为主过滤列: + +```sql +CREATE TABLE tenant_embeddings ( + tenant_id BIGINT NOT NULL, + doc_id BIGINT NOT NULL, + embedding ARRAY NOT NULL, + INDEX idx_embedding (embedding) USING ANN PROPERTIES ( + "index_type" = "pq_on_disk", + "metric_type" = "l2_distance", + "dim" = "768", + "pq_m" = "96", + "pq_nbits" = "8" + ) +) ENGINE=OLAP +DUPLICATE KEY(tenant_id, doc_id) +DISTRIBUTED BY HASH(tenant_id) BUCKETS 8 +PROPERTIES ( + "replication_num" = "1" +); +``` + +### 基础查询 + +```sql +SELECT doc_id, + l2_distance_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) AS score +FROM tenant_embeddings +WHERE tenant_id = 10001 +ORDER BY score ASC +LIMIT 20; +``` -它适合以下场景: +这正是 `pq_on_disk` 的核心目标场景:先过滤,再在过滤后的结果里执行高效的 Top-N 向量排序。 -- 查询几乎总是带有高选择性的标量过滤条件。 -- 相同过滤键对应的行具有较好的物理局部性。 -- 即使候选集已经被过滤缩小,暴力计算距离仍然开销较大。 -- 希望比内存型 ANN 结构有更低的常驻内存占用。 +## 工作原理 + +从实现角度看,`pq_on_disk` 的执行过程大致如下: + +1. Doris 为 segment 训练 PQ codebook。 +2. 原始向量被编码为紧凑的 PQ codes。 +3. PQ codes 按 rowid 顺序写入磁盘。 +4. 查询时,Doris 先计算普通谓词过滤。 +5. 对于通过过滤的行,再加载对应的 PQ chunk,并基于 PQ code 计算近似距离,而不是直接对原始 float32 向量做全量暴力计算。 + +因此,`pq_on_disk` 更适合被理解为**基于 PQ 的过滤后暴力计算加速**,而不是像 HNSW / IVF 那样的全局召回结构。 ## 用户接口 -### 1)建索引 DDL +### 1)索引 DDL -通过 `index_type="pq_on_disk"` 创建 ANN 索引。 +通过 `index_type="pq_on_disk"` 创建索引: ```sql CREATE TABLE image_pool ( @@ -81,34 +156,35 @@ DISTRIBUTED BY HASH(user_id) BUCKETS 8 PROPERTIES ("replication_num" = "1"); ``` -说明: - -- `metric_type` 支持 `l2_distance` 和 `inner_product`。 -- `dim` 为必填参数。 -- `pq_m` 为必填参数。 -- `dim` 必须能够被 `pq_m` 整除。 -- `pq_nbits` 为可选参数,默认值为 `8`。 -- 查询语法保持不变,仍使用 `l2_distance_approximate` 和 `inner_product_approximate`。 - ### 2)典型查询模式 -过滤后的 Top-N 重排: +过滤后的 Top-N: ```sql SELECT photo_id FROM image_pool WHERE user_id = 10001 -ORDER BY l2_distance_approximate(embedding, [0.12, 0.44, 0.33 /* ... */]) +ORDER BY l2_distance_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) ASC +LIMIT 20; +``` + +Prepared Statement 风格查询: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = ? +ORDER BY l2_distance_approximate(embedding, CAST(? AS ARRAY)) ASC LIMIT 20; ``` -如果使用内积,相应地按降序排序: +如果使用内积,则按降序排序: ```sql SELECT photo_id FROM image_pool WHERE user_id = 10001 -ORDER BY inner_product_approximate(embedding, [0.12, 0.44, 0.33 /* ... */]) DESC +ORDER BY inner_product_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) DESC LIMIT 20; ``` @@ -118,21 +194,10 @@ LIMIT 20; SELECT photo_id FROM image_pool WHERE user_id = 10001 - AND l2_distance_approximate(embedding, [0.12, 0.44, 0.33 /* ... */]) < 5.0 + AND l2_distance_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) < 500.0 ORDER BY photo_id; ``` -`pq_on_disk` 最重要的使用特征,就是它本身就是为“带过滤条件的向量重排”设计的,这一点和 `ivf_on_disk` 有明显区别。 - -### 3)BE 缓存配置 - -`pq_on_disk` 使用专用的 PQ chunk cache: - -- `ann_index_pq_chunk_cache_limit`(默认:`60%`) -- `ann_index_pq_chunk_cache_stale_sweep_time_sec`(默认:`1800`) - -其中 `ann_index_pq_chunk_cache_limit` 的百分比基准是 BE 进程可用内存(受 `mem_limit` 约束),不是整机物理内存。 - ## 参数与约束 ### 索引参数 @@ -142,12 +207,12 @@ ORDER BY photo_id; | `index_type` | 是 | - | 必须为 `pq_on_disk`。 | | `metric_type` | 是 | - | `l2_distance` 或 `inner_product`。 | | `dim` | 是 | - | 向量维度。 | -| `pq_m` | 是 | - | PQ 子量化器数量,必须整除 `dim`。 | -| `pq_nbits` | 否 | `8` | 每个子量化器编码使用的 bit 数。 | +| `pq_m` | 是 | - | PQ 子量化器数量,必须能整除 `dim`。 | +| `pq_nbits` | 否 | `8` | 每个子量化编码的 bit 数。 | -### 训练行为 +### 训练要求 -`pq_on_disk` 需要足够的数据来训练 PQ codebook。最小训练行数公式为: +`pq_on_disk` 需要足够的行数来训练 PQ codebook。最小训练行数为: ```text (1 << pq_nbits) * 100 @@ -155,50 +220,67 @@ ORDER BY photo_id; 例如: -- `pq_nbits = 8` 时,至少需要 `25600` 行训练数据。 +- `pq_nbits = 8` 时,至少需要 `25600` 行训练数据; - `pq_nbits = 4` 时,至少需要 `1600` 行训练数据。 -如果某个 segment 的数据量不足以训练 PQ 索引,Doris 可能会对该 segment 回退到暴力搜索。 +如果某个 segment 的数据量不足以训练 PQ 索引,Doris 可能会对该 segment 回退为暴力搜索。 + +## BE 缓存配置 + +`pq_on_disk` 使用专用的 PQ chunk cache: + +- `ann_index_pq_chunk_cache_limit`(默认:`60%`) +- `ann_index_pq_chunk_cache_stale_sweep_time_sec`(默认:`1800`) + +其中 `ann_index_pq_chunk_cache_limit` 的百分比基准是 BE 进程可用内存(受 `mem_limit` 约束),不是整机物理内存。 ## 可观测性 -`pq_on_disk` 引入了专用的 BE 缓存 `AnnIndexPqChunkCache`。 +`pq_on_disk` 引入了专用 BE 缓存 `AnnIndexPqChunkCache`。 -排查性能问题时,建议优先确认两件事: +排查问题时,建议优先关注: -- 查询是否真的具有足够高的过滤选择性。 -- PQ chunk cache 是否足够大,能够避免热点候选区间被重复从磁盘读取。 +- 查询是否真的足够高选择性; +- 过滤后的行是否具备较好的物理局部性; +- PQ chunk cache 是否足够大,是否频繁发生重复磁盘读取; +- 某些 segment 是否因为训练数据不足而回退为暴力搜索。 ## 使用说明 -- `pq_on_disk` 更适合“先过滤,再向量重排”的场景,不适合替代全局 ANN 召回索引。 -- 它与 Doris 现有 ANN 索引共享通用约束,例如向量列类型和 ANN 表达式的使用方式。 -- 它支持 `l2_distance` 和 `inner_product` 两种度量,也支持 Top-N 与 range search 风格的查询。 -- 查询结果的排序方向需要与度量语义一致:`l2_distance_approximate` 用升序,`inner_product_approximate` 用降序。 -- 数据局部性很重要。如果相同过滤键对应的行在物理上更连续,`pq_on_disk` 读取 PQ codes 时就更容易形成顺序 I/O。 -- 对于非常小的 segment 或训练数据不足的 segment,索引可能不会被真正构建,查询会回退到暴力搜索。 +- `pq_on_disk` 面向的是**过滤优先**的向量检索,而不是对整个 segment 做全局 ANN 召回。 +- 它尤其适合**多租户向量检索**,即多个租户的数据混合存储在同一个 segment 中的场景。 +- 它同时支持 `l2_distance` 和 `inner_product`,也支持 Top-N 与 range search 风格的查询。 +- 查询时排序方向要和度量语义一致:`l2_distance_approximate` 用升序,`inner_product_approximate` 用降序。 +- 数据局部性非常重要。如果同一过滤键对应的数据在物理上更连续,PQ chunk 读取就更容易形成顺序 I/O。 +- 对于非常小的 segment 或训练样本不足的 segment,Doris 可能不会真正构建 PQ 索引,而是回退为暴力搜索。 ## 最佳实践 -1. 当查询模式主要是“先过滤,后重排”时,优先考虑 `pq_on_disk`。 -2. 让过滤列尽可能具有较高选择性。过滤后的候选集越小,`pq_on_disk` 越能发挥优势。 -3. 选择 `pq_m` 时,先确保 `dim / pq_m` 合理,并尽量与现有 PQ 经验保持一致。 -4. 除非明确需要用更小 code size 换取更低精度,否则建议先从 `pq_nbits = 8` 开始。 -5. 联合观察缓存效果和查询延迟。如果同类过滤查询仍然频繁触发磁盘 I/O,可以提高 `ann_index_pq_chunk_cache_limit` 后重新测试。 -6. 在正式上线前,务必基于真实业务数据验证召回质量,尤其要关注真实过滤分布下的效果。 +1. 当主要查询模式是**先过滤,后重排**时,优先考虑 `pq_on_disk`。 +2. 对于 `WHERE tenant_id = ? ORDER BY ... LIMIT N` 这类**租户级检索**,优先评估 `pq_on_disk`。 +3. 让过滤列尽可能保持高选择性。过滤后候选集越小,`pq_on_disk` 越能发挥优势。 +4. 除非明确要用更小 code size 换取更低存储,否则建议从 `pq_nbits = 8` 开始。 +5. 选择 `pq_m` 时,要结合向量维度、模型特征以及实际召回目标综合评估。 +6. 对于 768 维及以上查询向量,建议使用 prepared statement,减少 SQL 解析开销。 +7. 在上线前务必基于真实业务分布进行验证,尤其是在不同租户数据量差异较大时。 + +## 如何在 `hnsw`、`ivf_on_disk` 与 `pq_on_disk` 之间选择 + +以下场景更适合 `hnsw`: -## 如何在 `ivf_on_disk` 和 `pq_on_disk` 之间选择 +- 需要高召回的全局 ANN 搜索; +- 查询延迟最优先,且内存资源足够。 以下场景更适合 `ivf_on_disk`: -- 需要在大规模全局向量集合上做 ANN 搜索。 -- 主要调优模型仍然是 `nlist` 和 `nprobe`。 -- 查询性能依赖于 IVF list 的探测与召回。 +- 仍然需要基于 IVF 的全局 ANN 召回模型; +- 内存有限,但查询仍然面向大规模全局向量集合。 以下场景更适合 `pq_on_disk`: -- 查询本身已经带有高选择性的标量过滤条件。 -- 过滤后的候选集规模相对较小。 -- 主要需求是在过滤后的候选行中做快速近似重排,而不是做全局 ANN 召回。 +- 查询本身已经带有高选择性的标量过滤条件; +- 不同租户或不同用户的数据混合存储在同一个 segment 中; +- 指定租户或用户过滤后,全局 ANN 的召回效果不理想; +- 希望通过压缩向量来加速过滤后候选集上的暴力距离计算。 -可以简单理解为:`ivf_on_disk` 是磁盘化的全局 ANN 索引,而 `pq_on_disk` 是磁盘化的过滤后近似重排索引。 +可以简单理解为:`pq_on_disk` 并不是替代所有 ANN 结构的统一方案,而是当主要问题变成**如何在过滤后的子集内高效完成向量重排**时,尤其是在多租户场景下,更合适的选择。 \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/overview.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/overview.md index de02c0eae3323..604e91aed1c60 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/overview.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/overview.md @@ -49,7 +49,7 @@ PROPERTIES ( "replication_num" = "1" ); ``` -- index_type: 可选 `hnsw`([Hierarchical Navigable Small World 算法](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world))、`ivf`(倒排文件索引)或 `ivf_on_disk`(倒排列表落盘并通过缓存提供查询能力的 IVF) +- index_type: 可选 `hnsw`([Hierarchical Navigable Small World 算法](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world))、`ivf`(倒排文件索引)、`ivf_on_disk`(倒排列表落盘并通过缓存提供查询能力的 IVF)或 `pq_on_disk`(将 PQ 编码向量落盘、用于过滤后向量重排加速) - metric_type: l2_distance 表示使用 L2 距离作为距离函数 - dim: 128 表示向量维度为 128 - quantizer: flat 表示按原始 float32 存储各维度 @@ -57,15 +57,15 @@ PROPERTIES ( | 参数 | 是否必填 | 支持/可选值 | 默认值 | 说明 | |------|----------|-------------|--------|------| -| `index_type` | 是 | 支持:`hnsw`、`ivf`、`ivf_on_disk` | (无) | 指定所使用的 ANN 索引算法。当前支持 HNSW、内存 IVF 和 IVF On-Disk。 | +| `index_type` | 是 | 支持:`hnsw`、`ivf`、`ivf_on_disk`、`pq_on_disk` | (无) | 指定所使用的 ANN 索引算法。当前支持 HNSW、内存 IVF、IVF On-Disk,以及面向高选择性过滤后重排的 PQ On-Disk。 | | `metric_type` | 是 | `l2_distance`,`inner_product` | (无) | 指定向量相似度/距离度量方式。L2 为欧氏距离,inner_product 可用于余弦相似时需先归一化向量。 | | `dim` | 是 | 正整数 (> 0) | (无) | 指定向量维度,后续导入的所有向量的维度必须与此一致,否则报错。 | | `nlist` | 否 | 正整数 | `1024` | IVF 的倒排桶数量。在 `index_type=ivf` 或 `index_type=ivf_on_disk` 时生效;取值越大通常有助于召回率/速度权衡,但会增加构建开销。 | | `max_degree` | 否 | 正整数 | `32` | HNSW 图中单个节点的最大邻居数(M),影响索引内存与搜索性能。 | | `ef_construction` | 否 | 正整数 | `40` | HNSW 构建阶段的候选队列大小(efConstruction),越大构图质量越好但构建更慢。 | | `quantizer` | 否 | `flat`,`sq8`,`sq4`, `pq` | `flat` | 指定向量编码/量化方式:`flat` 为原始存储,`sq8`/`sq4` 为标量量化(8/4 bit), `pq` 为乘积量化。 | -| `pq_m` | 'quantizer=pq' 时需要指定 | 正整数 | (无) | 指定将原始的高维向量分割成多少个子向量(向量维度 dim 必须能被 pq_m 整除)。 | -| `pq_nbits` | 'quantizer=pq' 时需要指定 | 正整数 | (无) | 指定每个子向量量化的比特数, 它决定了每个子空间码本的大小(k = 2 ^ pq_nbits), 在faiss中pq_nbits值一般要求不大于24。 | +| `pq_m` | `quantizer=pq` 或 `index_type=pq_on_disk` 时需要指定 | 正整数 | (无) | 指定将原始的高维向量分割成多少个子向量,向量维度 `dim` 必须能被 `pq_m` 整除。 | +| `pq_nbits` | `quantizer=pq` 时需要指定;`index_type=pq_on_disk` 时可选 | 正整数 | `pq_on_disk` 默认 `8` | 指定每个子向量量化的比特数。它决定了每个子空间码本的大小(k = 2 ^ pq_nbits),在 Faiss 中一般要求不大于 24。 | ## 如果业务需要使用 Cosine 相似度 @@ -293,6 +293,8 @@ PROPERTIES ( 量化会带来额外构建开销,原因是构建阶段需要大量距离计算,且每次计算需对量化值解码。以 128 维向量为例,随行数增长构建时间上升,SQ 相比 FLAT 可能引入约 10 倍构建成本。 +对于以 `tenant_id = ?`、`user_id = ?` 等高选择性过滤为主的查询,Doris 还提供了 [`pq_on_disk`](./pq-on-disk.md)。它不像 HNSW / IVF 那样构建面向全局召回的结构,而是通过磁盘上的 PQ 编码向量,加速过滤后候选集上的向量重排。这使它在多租户向量检索场景下尤其有价值:当一个 segment 中混合了多个租户的数据时,全局 ANN 结构在指定租户后可能召回下降,而 `pq_on_disk` 更适合这种“先过滤、后重排”的模式。 + 类似的, Doris也支持乘积量化, 不过需要注意的是在使用PQ时需要提供额外的参数: - `pq_m`: 表示将原始的高维向量分割成多少个子向量(向量维度 dim 必须能被 pq_m 整除)。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/pq-on-disk.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/pq-on-disk.md new file mode 100644 index 0000000000000..86f31062b43d9 --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/pq-on-disk.md @@ -0,0 +1,286 @@ +--- +{ + "title": "PQ On-Disk", + "language": "zh-CN", + "description": "PQ On-Disk 是 Apache Doris 面向过滤优先向量检索场景提供的向量索引形态,特别适用于多租户检索:通过在过滤后的候选集上使用 PQ 编码向量加速暴力距离计算,在较低内存占用下获得更稳定的效果。" +} +--- + + + +# Apache Doris 中的 PQ On-Disk + +`pq_on_disk` 是 Apache Doris 面向**过滤优先(filter-first)向量检索**场景提供的一种向量索引模式。它将 Product Quantization(PQ)编码后的向量存储在磁盘上,仅将 PQ codebook 和热点 chunk 保留在内存中,并在标量过滤完成后,利用压缩后的向量加速过滤结果上的暴力距离计算。 + +这个特性尤其适合**多租户向量检索**。在很多 SaaS 类业务中,不同租户的向量会被写入同一个 segment。如果直接在这些混合数据上构建全局 `hnsw` 或 `ivf` 索引,再执行 `WHERE tenant_id = ?` 这类查询,召回率往往会明显下降,因为全局召回结构是基于所有租户的混合数据构建的,而不是针对某一个租户的局部子集。`pq_on_disk` 不依赖这样的全局跨租户召回结构,而是先按租户过滤,再在过滤后的子集上通过 PQ 编码向量加速排序,因此更适合多租户场景。 + +## 适用场景 + +当查询模式通常是下面这样时,优先考虑 `pq_on_disk`: + +```sql +WHERE <高选择性过滤条件> +ORDER BY l2_distance_approximate(...) LIMIT N +``` + +常见例子包括: + +- `WHERE tenant_id = ?` +- `WHERE user_id = ?` +- `WHERE category_id = ? AND status = 'active'` +- `WHERE tag MATCH_ANY '...' + ORDER BY l2_distance_approximate(...) LIMIT N` + +这和全局 ANN 的目标不同: + +- `hnsw` 和 `ivf` 更适合在大规模向量集合上做**全局 ANN 召回**。 +- `ivf_on_disk` 仍然保留 IVF 的全局召回模型,只是将主要索引数据落盘以降低内存压力。 +- `pq_on_disk` 聚焦的是**过滤后子集上的向量重排**,即候选集已经被普通谓词显著缩小,Doris 只需要更快地对这些候选行做向量打分。 + +## 为什么它适合多租户检索 + +假设一个 segment 中混合存储了 10,000 个租户的向量。如果在这些数据上构建全局 HNSW 或 IVF 索引,而查询是: + +```sql +SELECT doc_id +FROM tenant_embeddings +WHERE tenant_id = 10001 +ORDER BY l2_distance_approximate(embedding, ) +LIMIT 20; +``` + +这个查询只关心某一个租户的数据,但全局 ANN 结构的训练、聚类或图连接都基于所有租户的混合向量。对于“全局召回”有效的图路径、邻接关系或 IVF 分桶,并不一定适合“租户过滤之后”的局部召回,因此很容易出现指定租户后召回率下降的问题。 + +`pq_on_disk` 的处理方式不同: + +1. Doris 先执行 `tenant_id = 10001` 这样的标量过滤。 +2. 得到该租户对应的候选集。 +3. 不再依赖全局 ANN 结构在这个子集内做召回,而是使用 PQ 编码向量更快地计算这些候选行的距离。 +4. PQ code 按 rowid 顺序存储在磁盘,并通过专用 chunk cache 做复用。 + +因此,当满足以下条件时,`pq_on_disk` 往往比全局 ANN 结构更合适: + +- 过滤条件具有高选择性; +- 租户过滤后的召回稳定性比全局 ANN 更重要; +- 原始 float32 向量上的暴力距离计算仍然代价较高。 + +## 快速开始 + +### 建表 + +下面的例子使用 `tenant_id` 作为主过滤列: + +```sql +CREATE TABLE tenant_embeddings ( + tenant_id BIGINT NOT NULL, + doc_id BIGINT NOT NULL, + embedding ARRAY NOT NULL, + INDEX idx_embedding (embedding) USING ANN PROPERTIES ( + "index_type" = "pq_on_disk", + "metric_type" = "l2_distance", + "dim" = "768", + "pq_m" = "96", + "pq_nbits" = "8" + ) +) ENGINE=OLAP +DUPLICATE KEY(tenant_id, doc_id) +DISTRIBUTED BY HASH(tenant_id) BUCKETS 8 +PROPERTIES ( + "replication_num" = "1" +); +``` + +### 基础查询 + +```sql +SELECT doc_id, + l2_distance_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) AS score +FROM tenant_embeddings +WHERE tenant_id = 10001 +ORDER BY score ASC +LIMIT 20; +``` + +这正是 `pq_on_disk` 的核心目标场景:先过滤,再在过滤后的结果里执行高效的 Top-N 向量排序。 + +## 工作原理 + +从实现角度看,`pq_on_disk` 的执行过程大致如下: + +1. Doris 为 segment 训练 PQ codebook。 +2. 原始向量被编码为紧凑的 PQ codes。 +3. PQ codes 按 rowid 顺序写入磁盘。 +4. 查询时,Doris 先计算普通谓词过滤。 +5. 对于通过过滤的行,再加载对应的 PQ chunk,并基于 PQ code 计算近似距离,而不是直接对原始 float32 向量做全量暴力计算。 + +因此,`pq_on_disk` 更适合被理解为**基于 PQ 的过滤后暴力计算加速**,而不是像 HNSW / IVF 那样的全局召回结构。 + +## 用户接口 + +### 1)索引 DDL + +通过 `index_type="pq_on_disk"` 创建索引: + +```sql +CREATE TABLE image_pool ( + user_id BIGINT NOT NULL, + photo_id BIGINT NOT NULL, + embedding ARRAY NOT NULL, + INDEX idx_emb (embedding) USING ANN PROPERTIES ( + "index_type" = "pq_on_disk", + "metric_type" = "l2_distance", + "dim" = "768", + "pq_m" = "96", + "pq_nbits" = "8" + ) +) ENGINE=OLAP +DUPLICATE KEY(user_id, photo_id) +DISTRIBUTED BY HASH(user_id) BUCKETS 8 +PROPERTIES ("replication_num" = "1"); +``` + +### 2)典型查询模式 + +过滤后的 Top-N: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = 10001 +ORDER BY l2_distance_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) ASC +LIMIT 20; +``` + +Prepared Statement 风格查询: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = ? +ORDER BY l2_distance_approximate(embedding, CAST(? AS ARRAY)) ASC +LIMIT 20; +``` + +如果使用内积,则按降序排序: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = 10001 +ORDER BY inner_product_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) DESC +LIMIT 20; +``` + +也支持 range search: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = 10001 + AND l2_distance_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) < 500.0 +ORDER BY photo_id; +``` + +## 参数与约束 + +### 索引参数 + +| 属性 | 是否必填 | 默认值 | 说明 | +|---|---|---|---| +| `index_type` | 是 | - | 必须为 `pq_on_disk`。 | +| `metric_type` | 是 | - | `l2_distance` 或 `inner_product`。 | +| `dim` | 是 | - | 向量维度。 | +| `pq_m` | 是 | - | PQ 子量化器数量,必须能整除 `dim`。 | +| `pq_nbits` | 否 | `8` | 每个子量化编码的 bit 数。 | + +### 训练要求 + +`pq_on_disk` 需要足够的行数来训练 PQ codebook。最小训练行数为: + +```text +(1 << pq_nbits) * 100 +``` + +例如: + +- `pq_nbits = 8` 时,至少需要 `25600` 行训练数据; +- `pq_nbits = 4` 时,至少需要 `1600` 行训练数据。 + +如果某个 segment 的数据量不足以训练 PQ 索引,Doris 可能会对该 segment 回退为暴力搜索。 + +## BE 缓存配置 + +`pq_on_disk` 使用专用的 PQ chunk cache: + +- `ann_index_pq_chunk_cache_limit`(默认:`60%`) +- `ann_index_pq_chunk_cache_stale_sweep_time_sec`(默认:`1800`) + +其中 `ann_index_pq_chunk_cache_limit` 的百分比基准是 BE 进程可用内存(受 `mem_limit` 约束),不是整机物理内存。 + +## 可观测性 + +`pq_on_disk` 引入了专用 BE 缓存 `AnnIndexPqChunkCache`。 + +排查问题时,建议优先关注: + +- 查询是否真的足够高选择性; +- 过滤后的行是否具备较好的物理局部性; +- PQ chunk cache 是否足够大,是否频繁发生重复磁盘读取; +- 某些 segment 是否因为训练数据不足而回退为暴力搜索。 + +## 使用说明 + +- `pq_on_disk` 面向的是**过滤优先**的向量检索,而不是对整个 segment 做全局 ANN 召回。 +- 它尤其适合**多租户向量检索**,即多个租户的数据混合存储在同一个 segment 中的场景。 +- 它同时支持 `l2_distance` 和 `inner_product`,也支持 Top-N 与 range search 风格的查询。 +- 查询时排序方向要和度量语义一致:`l2_distance_approximate` 用升序,`inner_product_approximate` 用降序。 +- 数据局部性非常重要。如果同一过滤键对应的数据在物理上更连续,PQ chunk 读取就更容易形成顺序 I/O。 +- 对于非常小的 segment 或训练样本不足的 segment,Doris 可能不会真正构建 PQ 索引,而是回退为暴力搜索。 + +## 最佳实践 + +1. 当主要查询模式是**先过滤,后重排**时,优先考虑 `pq_on_disk`。 +2. 对于 `WHERE tenant_id = ? ORDER BY ... LIMIT N` 这类**租户级检索**,优先评估 `pq_on_disk`。 +3. 让过滤列尽可能保持高选择性。过滤后候选集越小,`pq_on_disk` 越能发挥优势。 +4. 除非明确要用更小 code size 换取更低存储,否则建议从 `pq_nbits = 8` 开始。 +5. 选择 `pq_m` 时,要结合向量维度、模型特征以及实际召回目标综合评估。 +6. 对于 768 维及以上查询向量,建议使用 prepared statement,减少 SQL 解析开销。 +7. 在上线前务必基于真实业务分布进行验证,尤其是在不同租户数据量差异较大时。 + +## 如何在 `hnsw`、`ivf_on_disk` 与 `pq_on_disk` 之间选择 + +以下场景更适合 `hnsw`: + +- 需要高召回的全局 ANN 搜索; +- 查询延迟最优先,且内存资源足够。 + +以下场景更适合 `ivf_on_disk`: + +- 仍然需要基于 IVF 的全局 ANN 召回模型; +- 内存有限,但查询仍然面向大规模全局向量集合。 + +以下场景更适合 `pq_on_disk`: + +- 查询本身已经带有高选择性的标量过滤条件; +- 不同租户或不同用户的数据混合存储在同一个 segment 中; +- 指定租户或用户过滤后,全局 ANN 的召回效果不理想; +- 希望通过压缩向量来加速过滤后候选集上的暴力距离计算。 + +可以简单理解为:`pq_on_disk` 并不是替代所有 ANN 结构的统一方案,而是当主要问题变成**如何在过滤后的子集内高效完成向量重排**时,尤其是在多租户场景下,更合适的选择。 \ No newline at end of file diff --git a/versioned_docs/version-4.x/ai/vector-search/overview.md b/versioned_docs/version-4.x/ai/vector-search/overview.md index d04e99932cd05..8dbec97274113 100644 --- a/versioned_docs/version-4.x/ai/vector-search/overview.md +++ b/versioned_docs/version-4.x/ai/vector-search/overview.md @@ -58,22 +58,22 @@ PROPERTIES ( ); ``` -- index_type: `hnsw` (for [Hierarchical Navigable Small World](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world)), `ivf` (for inverted file), or `ivf_on_disk` (for IVF with inverted lists stored on disk and served through cache) +- index_type: `hnsw` (for [Hierarchical Navigable Small World](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world)), `ivf` (for inverted file), `ivf_on_disk` (for IVF with inverted lists stored on disk and served through cache), or `pq_on_disk` (for filter-first reranking accelerated by PQ-encoded vectors stored on disk) - metric_type: `l2_distance` means using L2 distance as the distance function - dim: `128` means the vector dimension is 128 - quantizer: `flat` means each vector dimension is stored as original float32 | Parameter | Required | Supported/Options | Default | Description | |-----------|----------|-------------------|---------|-------------| -| `index_type` | Yes | `hnsw`, `ivf`, `ivf_on_disk` | (none) | ANN index algorithm. Supports HNSW, in-memory IVF, and IVF On-Disk. | +| `index_type` | Yes | `hnsw`, `ivf`, `ivf_on_disk`, `pq_on_disk` | (none) | ANN index algorithm. Supports HNSW, in-memory IVF, IVF On-Disk, and PQ On-Disk for selective filter-first reranking. | | `metric_type` | Yes | `l2_distance`, `inner_product` | (none) | Vector similarity/distance metric. L2 = Euclidean; inner_product can approximate cosine if vectors are normalized. | | `dim` | Yes | Positive integer (> 0) | (none) | Vector dimension. All imported vectors must match or an error is raised. | | `nlist` | No | Positive integer | `1024` | IVF inverted-list count. Effective when `index_type=ivf` or `index_type=ivf_on_disk`; larger values may improve recall/speed trade-offs but increase build overhead. | | `max_degree` | No | Positive integer | `32` | HNSW M (max neighbors per node). Affects index memory and search performance. | | `ef_construction` | No | Positive integer | `40` | HNSW efConstruction (candidate queue size during build). Larger gives better quality but slower build. | | `quantizer` | No | `flat`, `sq8`, `sq4`, `pq` | `flat` | Vector encoding/quantization: `flat` = raw; `sq8`/`sq4` = scalar quantization (8/4 bit), `pq` = product quantization to reduce memory. | -| `pq_m` | Required when 'quantizer=pq' | Positive integer | (none) | Specifies how many subvectors are used (vector dimension dim must be divisible by pq_m). | -| `pq_nbits` | Required when 'quantizer=pq' | Positive integer | (none) | The number of bits used to represent each subvector, in faiss pq_nbits is generally required to be no greater than 24. | +| `pq_m` | Required when `quantizer=pq` or `index_type=pq_on_disk` | Positive integer | (none) | Number of subvectors. The vector dimension `dim` must be divisible by `pq_m`. | +| `pq_nbits` | Required when `quantizer=pq`; optional when `index_type=pq_on_disk` | Positive integer | `8` for `pq_on_disk` | Number of bits used to represent each subvector. In Faiss, `pq_nbits` is generally required to be no greater than 24. | ## If You Need Cosine Similarity @@ -313,6 +313,8 @@ On 768-D Cohere-MEDIUM-1M and Cohere-LARGE-10M datasets, SQ8 reduces index size Quantization introduces extra build-time overhead because each distance computation must decode quantized values. For 128-D vectors, build time increases with row count; SQ vs. FLAT can be up to ~10× slower to build. +For workloads dominated by highly selective filters such as `tenant_id = ?` or `user_id = ?`, Doris also provides [`pq_on_disk`](./pq-on-disk.md). Unlike global ANN structures such as HNSW or IVF, `pq_on_disk` is designed to accelerate vector reranking inside the filtered subset by using PQ-encoded vectors stored on disk. This makes it especially useful for multi-tenant vector search, where global ANN structures built on mixed-tenant segments may suffer recall degradation after tenant filtering. + Similarly, Doris also supports product quantization, but note that when using PQ, additional parameters need to be provided: - `pq_m`: Indicates how many sub-vectors to split the original high-dimensional vector into (vector dimension dim must be divisible by pq_m). diff --git a/versioned_docs/version-4.x/ai/vector-search/pq-on-disk.md b/versioned_docs/version-4.x/ai/vector-search/pq-on-disk.md new file mode 100644 index 0000000000000..1c6cd204a6379 --- /dev/null +++ b/versioned_docs/version-4.x/ai/vector-search/pq-on-disk.md @@ -0,0 +1,286 @@ +--- +{ + "title": "PQ On-Disk", + "language": "en", + "description": "PQ On-Disk is a disk-backed vector reranking mode in Apache Doris. It is designed for selective filter-first workloads such as multi-tenant vector search, and uses PQ-encoded vectors to accelerate brute-force distance evaluation on filtered rows." +} +--- + + + +# PQ On-Disk in Apache Doris + +`pq_on_disk` is a vector index mode in Apache Doris for **filter-first vector search**. It stores Product Quantization (PQ) codes on disk, keeps only the PQ codebook and hot chunks in memory, and uses the compressed vectors to accelerate brute-force-style distance evaluation on rows that have already passed scalar filtering. + +This feature is especially useful in **multi-tenant vector search**. In many SaaS-style workloads, vectors from many tenants are stored together in the same segment. If you build a global `hnsw` or `ivf` index on that mixed data and then query with predicates such as `WHERE tenant_id = ?`, the ANN recall can degrade significantly because the global recall structure was built across all tenants rather than for one tenant's local subset. `pq_on_disk` avoids this problem by not depending on a global cross-tenant recall structure. Instead, Doris first applies the tenant filter, then uses PQ codes to accelerate vector scoring inside the filtered subset. + +## When to Use PQ On-Disk + +Use `pq_on_disk` when your query pattern is usually: + +```sql +WHERE +ORDER BY l2_distance_approximate(...) LIMIT N +``` + +Typical examples include: + +- `WHERE tenant_id = ?` +- `WHERE user_id = ?` +- `WHERE category_id = ? AND status = 'active'` +- `WHERE tag MATCH_ANY '...' + ORDER BY l2_distance_approximate(...) LIMIT N` + +This is a different operating point from global ANN search: + +- `hnsw` and `ivf` are designed for **global ANN recall** across a large vector collection. +- `ivf_on_disk` keeps the IVF recall model but moves the main IVF data to disk to reduce memory pressure. +- `pq_on_disk` is designed for **filtered-subset reranking**, where the candidate set is already narrowed down by ordinary predicates and Doris needs a faster way to score those rows. + +## Why It Helps in Multi-Tenant Search + +Suppose a segment contains vectors from 10,000 tenants. A global HNSW or IVF index is built over all rows in the segment. If the query is: + +```sql +SELECT doc_id +FROM tenant_embeddings +WHERE tenant_id = 10001 +ORDER BY l2_distance_approximate(embedding, ) +LIMIT 20; +``` + +The query only cares about one tenant's rows, but the global ANN structure was trained or connected using vectors from all tenants. The nearest paths, graph edges, or IVF partitions that are good for global recall are not necessarily good for recall **after tenant filtering**. + +`pq_on_disk` addresses this case differently: + +1. Doris first applies the scalar predicate such as `tenant_id = 10001`. +2. It obtains a filtered candidate set for that tenant. +3. Instead of computing full float32 brute-force distances on every filtered row, Doris uses PQ-encoded vectors to evaluate distances much faster. +4. PQ code data is read from disk in rowid order and reused through a dedicated chunk cache. + +As a result, `pq_on_disk` is often a better fit than global ANN structures when: + +- the filter is highly selective, +- recall under post-filter/global ANN is unstable, +- and full brute-force over raw vectors is still too expensive. + +## Quick Start + +### Create a table + +The following example uses `tenant_id` as the main filter column: + +```sql +CREATE TABLE tenant_embeddings ( + tenant_id BIGINT NOT NULL, + doc_id BIGINT NOT NULL, + embedding ARRAY NOT NULL, + INDEX idx_embedding (embedding) USING ANN PROPERTIES ( + "index_type" = "pq_on_disk", + "metric_type" = "l2_distance", + "dim" = "768", + "pq_m" = "96", + "pq_nbits" = "8" + ) +) ENGINE=OLAP +DUPLICATE KEY(tenant_id, doc_id) +DISTRIBUTED BY HASH(tenant_id) BUCKETS 8 +PROPERTIES ( + "replication_num" = "1" +); +``` + +### Basic query + +```sql +SELECT doc_id, + l2_distance_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) AS score +FROM tenant_embeddings +WHERE tenant_id = 10001 +ORDER BY score ASC +LIMIT 20; +``` + +This query pattern is the primary target of `pq_on_disk`: filter first, then do fast vector Top-N inside the filtered rows. + +## How PQ On-Disk Works + +At a high level: + +1. Doris trains a PQ codebook for the segment. +2. Raw vectors are encoded into compact PQ codes. +3. PQ codes are stored on disk in rowid order. +4. At query time, Doris first evaluates ordinary predicates. +5. For rows that survive filtering, Doris loads the corresponding PQ chunks and computes approximate distances using PQ codes instead of full raw vectors. + +So `pq_on_disk` is best understood as **PQ-accelerated filtered brute-force**, rather than a global ANN recall structure like HNSW or IVF. + +## User-Facing Interfaces + +### 1) Index DDL + +Use `index_type="pq_on_disk"` in ANN index properties. + +```sql +CREATE TABLE image_pool ( + user_id BIGINT NOT NULL, + photo_id BIGINT NOT NULL, + embedding ARRAY NOT NULL, + INDEX idx_emb (embedding) USING ANN PROPERTIES ( + "index_type" = "pq_on_disk", + "metric_type" = "l2_distance", + "dim" = "768", + "pq_m" = "96", + "pq_nbits" = "8" + ) +) ENGINE=OLAP +DUPLICATE KEY(user_id, photo_id) +DISTRIBUTED BY HASH(user_id) BUCKETS 8 +PROPERTIES ("replication_num" = "1"); +``` + +### 2) Typical query patterns + +Top-N after filtering: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = 10001 +ORDER BY l2_distance_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) ASC +LIMIT 20; +``` + +Prepared-statement style query: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = ? +ORDER BY l2_distance_approximate(embedding, CAST(? AS ARRAY)) ASC +LIMIT 20; +``` + +For inner-product search, sort in descending order: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = 10001 +ORDER BY inner_product_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) DESC +LIMIT 20; +``` + +Range search is also supported: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = 10001 + AND l2_distance_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) < 500.0 +ORDER BY photo_id; +``` + +## Parameters and Constraints + +### Index parameters + +| Property | Required | Default | Description | +|---|---|---|---| +| `index_type` | Yes | - | Must be `pq_on_disk`. | +| `metric_type` | Yes | - | `l2_distance` or `inner_product`. | +| `dim` | Yes | - | Vector dimension. | +| `pq_m` | Yes | - | Number of PQ subquantizers. Must divide `dim`. | +| `pq_nbits` | No | `8` | Number of bits per subquantizer code. | + +### Training behavior + +`pq_on_disk` needs enough rows to train the PQ codebook. The minimum training row count is: + +```text +(1 << pq_nbits) * 100 +``` + +Examples: + +- `pq_nbits = 8` requires at least `25600` training rows. +- `pq_nbits = 4` requires at least `1600` training rows. + +If a segment does not have enough rows to train the PQ index, Doris can fall back to brute-force search for that segment. + +## BE Cache Configuration + +`pq_on_disk` uses a dedicated chunk cache for PQ code data: + +- `ann_index_pq_chunk_cache_limit` (default: `60%`) +- `ann_index_pq_chunk_cache_stale_sweep_time_sec` (default: `1800`) + +The percentage value of `ann_index_pq_chunk_cache_limit` is based on process-available memory (`mem_limit`), not total machine memory. + +## Observability + +`pq_on_disk` introduces a dedicated BE cache named `AnnIndexPqChunkCache`. + +When troubleshooting, check the following first: + +- Whether the query is actually selective enough. +- Whether the filtered rows have good locality. +- Whether the PQ chunk cache is large enough to avoid repeated disk reads. +- Whether some segments are falling back to brute force because they do not have enough rows for PQ training. + +## Usage Notes + +- `pq_on_disk` is intended for **filter-first** workloads, not for global ANN recall across the whole segment. +- It is particularly suitable for **multi-tenant vector search** where rows from many tenants are mixed in the same segment. +- It supports both `l2_distance` and `inner_product`, including Top-N and range-search style queries. +- Query result ordering must match metric semantics: `l2_distance_approximate` uses ascending order, while `inner_product_approximate` uses descending order. +- Data locality matters. It works best when rows for the same filter key are physically close so PQ chunk reads are more sequential. +- For very small segments or insufficient training data, Doris may not build the PQ index and can fall back to brute force. + +## Best Practices + +1. Choose `pq_on_disk` when the query pattern is usually **filter first, rerank second**. +2. Prefer it for **tenant-aware retrieval** such as `WHERE tenant_id = ? ORDER BY ... LIMIT N`. +3. Keep the filter column selective. The smaller the filtered candidate set, the more suitable `pq_on_disk` becomes. +4. Start with `pq_nbits = 8` unless you intentionally want a smaller code size at the cost of recall. +5. Choose `pq_m` so that `dim / pq_m` is reasonable for your model dimension and business recall target. +6. Use prepared statements for 768-D and higher query vectors to reduce SQL parsing overhead. +7. Validate on real business distributions, especially when tenant sizes are very uneven. + +## How to Choose Between `hnsw`, `ivf_on_disk`, and `pq_on_disk` + +Use `hnsw` when: + +- You need high-recall global ANN search. +- Query latency is the top priority and enough memory is available. + +Use `ivf_on_disk` when: + +- You still need a global IVF-style ANN recall model. +- Memory is limited, but the query still searches a large global vector collection. + +Use `pq_on_disk` when: + +- The query already has a highly selective scalar filter. +- Rows from different tenants or users are mixed in the same segment. +- Global ANN recall under tenant/user filtering is poor. +- You want to accelerate filtered brute-force scoring with compressed vectors. + +In short, `pq_on_disk` is not a replacement for all ANN structures. It is the right choice when the main problem is **efficient vector reranking inside a filtered subset**, especially in multi-tenant workloads. \ No newline at end of file diff --git a/versioned_sidebars/version-4.x-sidebars.json b/versioned_sidebars/version-4.x-sidebars.json index cf0ed80b49825..1cec52e08e7b8 100644 --- a/versioned_sidebars/version-4.x-sidebars.json +++ b/versioned_sidebars/version-4.x-sidebars.json @@ -359,6 +359,8 @@ "ai/vector-search/practical-guide", "ai/vector-search/hnsw", "ai/vector-search/ivf", + "ai/vector-search/ivf-on-disk", + "ai/vector-search/pq-on-disk", "ai/vector-search/index-management", "ai/vector-search/resource-estimation", "ai/vector-search/quantization-survey",