diff --git a/docs/ai/vector-search/overview.md b/docs/ai/vector-search/overview.md index d04e99932cd05..8dbec97274113 100644 --- a/docs/ai/vector-search/overview.md +++ b/docs/ai/vector-search/overview.md @@ -58,22 +58,22 @@ PROPERTIES ( ); ``` -- index_type: `hnsw` (for [Hierarchical Navigable Small World](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world)), `ivf` (for inverted file), or `ivf_on_disk` (for IVF with inverted lists stored on disk and served through cache) +- index_type: `hnsw` (for [Hierarchical Navigable Small World](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world)), `ivf` (for inverted file), `ivf_on_disk` (for IVF with inverted lists stored on disk and served through cache), or `pq_on_disk` (for filter-first reranking accelerated by PQ-encoded vectors stored on disk) - metric_type: `l2_distance` means using L2 distance as the distance function - dim: `128` means the vector dimension is 128 - quantizer: `flat` means each vector dimension is stored as original float32 | Parameter | Required | Supported/Options | Default | Description | |-----------|----------|-------------------|---------|-------------| -| `index_type` | Yes | `hnsw`, `ivf`, `ivf_on_disk` | (none) | ANN index algorithm. Supports HNSW, in-memory IVF, and IVF On-Disk. | +| `index_type` | Yes | `hnsw`, `ivf`, `ivf_on_disk`, `pq_on_disk` | (none) | ANN index algorithm. Supports HNSW, in-memory IVF, IVF On-Disk, and PQ On-Disk for selective filter-first reranking. | | `metric_type` | Yes | `l2_distance`, `inner_product` | (none) | Vector similarity/distance metric. L2 = Euclidean; inner_product can approximate cosine if vectors are normalized. | | `dim` | Yes | Positive integer (> 0) | (none) | Vector dimension. All imported vectors must match or an error is raised. | | `nlist` | No | Positive integer | `1024` | IVF inverted-list count. Effective when `index_type=ivf` or `index_type=ivf_on_disk`; larger values may improve recall/speed trade-offs but increase build overhead. | | `max_degree` | No | Positive integer | `32` | HNSW M (max neighbors per node). Affects index memory and search performance. | | `ef_construction` | No | Positive integer | `40` | HNSW efConstruction (candidate queue size during build). Larger gives better quality but slower build. | | `quantizer` | No | `flat`, `sq8`, `sq4`, `pq` | `flat` | Vector encoding/quantization: `flat` = raw; `sq8`/`sq4` = scalar quantization (8/4 bit), `pq` = product quantization to reduce memory. | -| `pq_m` | Required when 'quantizer=pq' | Positive integer | (none) | Specifies how many subvectors are used (vector dimension dim must be divisible by pq_m). | -| `pq_nbits` | Required when 'quantizer=pq' | Positive integer | (none) | The number of bits used to represent each subvector, in faiss pq_nbits is generally required to be no greater than 24. | +| `pq_m` | Required when `quantizer=pq` or `index_type=pq_on_disk` | Positive integer | (none) | Number of subvectors. The vector dimension `dim` must be divisible by `pq_m`. | +| `pq_nbits` | Required when `quantizer=pq`; optional when `index_type=pq_on_disk` | Positive integer | `8` for `pq_on_disk` | Number of bits used to represent each subvector. In Faiss, `pq_nbits` is generally required to be no greater than 24. | ## If You Need Cosine Similarity @@ -313,6 +313,8 @@ On 768-D Cohere-MEDIUM-1M and Cohere-LARGE-10M datasets, SQ8 reduces index size Quantization introduces extra build-time overhead because each distance computation must decode quantized values. For 128-D vectors, build time increases with row count; SQ vs. FLAT can be up to ~10× slower to build. +For workloads dominated by highly selective filters such as `tenant_id = ?` or `user_id = ?`, Doris also provides [`pq_on_disk`](./pq-on-disk.md). Unlike global ANN structures such as HNSW or IVF, `pq_on_disk` is designed to accelerate vector reranking inside the filtered subset by using PQ-encoded vectors stored on disk. This makes it especially useful for multi-tenant vector search, where global ANN structures built on mixed-tenant segments may suffer recall degradation after tenant filtering. + Similarly, Doris also supports product quantization, but note that when using PQ, additional parameters need to be provided: - `pq_m`: Indicates how many sub-vectors to split the original high-dimensional vector into (vector dimension dim must be divisible by pq_m). diff --git a/docs/ai/vector-search/pq-on-disk.md b/docs/ai/vector-search/pq-on-disk.md new file mode 100644 index 0000000000000..1c6cd204a6379 --- /dev/null +++ b/docs/ai/vector-search/pq-on-disk.md @@ -0,0 +1,286 @@ +--- +{ + "title": "PQ On-Disk", + "language": "en", + "description": "PQ On-Disk is a disk-backed vector reranking mode in Apache Doris. It is designed for selective filter-first workloads such as multi-tenant vector search, and uses PQ-encoded vectors to accelerate brute-force distance evaluation on filtered rows." +} +--- + + + +# PQ On-Disk in Apache Doris + +`pq_on_disk` is a vector index mode in Apache Doris for **filter-first vector search**. It stores Product Quantization (PQ) codes on disk, keeps only the PQ codebook and hot chunks in memory, and uses the compressed vectors to accelerate brute-force-style distance evaluation on rows that have already passed scalar filtering. + +This feature is especially useful in **multi-tenant vector search**. In many SaaS-style workloads, vectors from many tenants are stored together in the same segment. If you build a global `hnsw` or `ivf` index on that mixed data and then query with predicates such as `WHERE tenant_id = ?`, the ANN recall can degrade significantly because the global recall structure was built across all tenants rather than for one tenant's local subset. `pq_on_disk` avoids this problem by not depending on a global cross-tenant recall structure. Instead, Doris first applies the tenant filter, then uses PQ codes to accelerate vector scoring inside the filtered subset. + +## When to Use PQ On-Disk + +Use `pq_on_disk` when your query pattern is usually: + +```sql +WHERE +ORDER BY l2_distance_approximate(...) LIMIT N +``` + +Typical examples include: + +- `WHERE tenant_id = ?` +- `WHERE user_id = ?` +- `WHERE category_id = ? AND status = 'active'` +- `WHERE tag MATCH_ANY '...' + ORDER BY l2_distance_approximate(...) LIMIT N` + +This is a different operating point from global ANN search: + +- `hnsw` and `ivf` are designed for **global ANN recall** across a large vector collection. +- `ivf_on_disk` keeps the IVF recall model but moves the main IVF data to disk to reduce memory pressure. +- `pq_on_disk` is designed for **filtered-subset reranking**, where the candidate set is already narrowed down by ordinary predicates and Doris needs a faster way to score those rows. + +## Why It Helps in Multi-Tenant Search + +Suppose a segment contains vectors from 10,000 tenants. A global HNSW or IVF index is built over all rows in the segment. If the query is: + +```sql +SELECT doc_id +FROM tenant_embeddings +WHERE tenant_id = 10001 +ORDER BY l2_distance_approximate(embedding, ) +LIMIT 20; +``` + +The query only cares about one tenant's rows, but the global ANN structure was trained or connected using vectors from all tenants. The nearest paths, graph edges, or IVF partitions that are good for global recall are not necessarily good for recall **after tenant filtering**. + +`pq_on_disk` addresses this case differently: + +1. Doris first applies the scalar predicate such as `tenant_id = 10001`. +2. It obtains a filtered candidate set for that tenant. +3. Instead of computing full float32 brute-force distances on every filtered row, Doris uses PQ-encoded vectors to evaluate distances much faster. +4. PQ code data is read from disk in rowid order and reused through a dedicated chunk cache. + +As a result, `pq_on_disk` is often a better fit than global ANN structures when: + +- the filter is highly selective, +- recall under post-filter/global ANN is unstable, +- and full brute-force over raw vectors is still too expensive. + +## Quick Start + +### Create a table + +The following example uses `tenant_id` as the main filter column: + +```sql +CREATE TABLE tenant_embeddings ( + tenant_id BIGINT NOT NULL, + doc_id BIGINT NOT NULL, + embedding ARRAY NOT NULL, + INDEX idx_embedding (embedding) USING ANN PROPERTIES ( + "index_type" = "pq_on_disk", + "metric_type" = "l2_distance", + "dim" = "768", + "pq_m" = "96", + "pq_nbits" = "8" + ) +) ENGINE=OLAP +DUPLICATE KEY(tenant_id, doc_id) +DISTRIBUTED BY HASH(tenant_id) BUCKETS 8 +PROPERTIES ( + "replication_num" = "1" +); +``` + +### Basic query + +```sql +SELECT doc_id, + l2_distance_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) AS score +FROM tenant_embeddings +WHERE tenant_id = 10001 +ORDER BY score ASC +LIMIT 20; +``` + +This query pattern is the primary target of `pq_on_disk`: filter first, then do fast vector Top-N inside the filtered rows. + +## How PQ On-Disk Works + +At a high level: + +1. Doris trains a PQ codebook for the segment. +2. Raw vectors are encoded into compact PQ codes. +3. PQ codes are stored on disk in rowid order. +4. At query time, Doris first evaluates ordinary predicates. +5. For rows that survive filtering, Doris loads the corresponding PQ chunks and computes approximate distances using PQ codes instead of full raw vectors. + +So `pq_on_disk` is best understood as **PQ-accelerated filtered brute-force**, rather than a global ANN recall structure like HNSW or IVF. + +## User-Facing Interfaces + +### 1) Index DDL + +Use `index_type="pq_on_disk"` in ANN index properties. + +```sql +CREATE TABLE image_pool ( + user_id BIGINT NOT NULL, + photo_id BIGINT NOT NULL, + embedding ARRAY NOT NULL, + INDEX idx_emb (embedding) USING ANN PROPERTIES ( + "index_type" = "pq_on_disk", + "metric_type" = "l2_distance", + "dim" = "768", + "pq_m" = "96", + "pq_nbits" = "8" + ) +) ENGINE=OLAP +DUPLICATE KEY(user_id, photo_id) +DISTRIBUTED BY HASH(user_id) BUCKETS 8 +PROPERTIES ("replication_num" = "1"); +``` + +### 2) Typical query patterns + +Top-N after filtering: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = 10001 +ORDER BY l2_distance_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) ASC +LIMIT 20; +``` + +Prepared-statement style query: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = ? +ORDER BY l2_distance_approximate(embedding, CAST(? AS ARRAY)) ASC +LIMIT 20; +``` + +For inner-product search, sort in descending order: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = 10001 +ORDER BY inner_product_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) DESC +LIMIT 20; +``` + +Range search is also supported: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = 10001 + AND l2_distance_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) < 500.0 +ORDER BY photo_id; +``` + +## Parameters and Constraints + +### Index parameters + +| Property | Required | Default | Description | +|---|---|---|---| +| `index_type` | Yes | - | Must be `pq_on_disk`. | +| `metric_type` | Yes | - | `l2_distance` or `inner_product`. | +| `dim` | Yes | - | Vector dimension. | +| `pq_m` | Yes | - | Number of PQ subquantizers. Must divide `dim`. | +| `pq_nbits` | No | `8` | Number of bits per subquantizer code. | + +### Training behavior + +`pq_on_disk` needs enough rows to train the PQ codebook. The minimum training row count is: + +```text +(1 << pq_nbits) * 100 +``` + +Examples: + +- `pq_nbits = 8` requires at least `25600` training rows. +- `pq_nbits = 4` requires at least `1600` training rows. + +If a segment does not have enough rows to train the PQ index, Doris can fall back to brute-force search for that segment. + +## BE Cache Configuration + +`pq_on_disk` uses a dedicated chunk cache for PQ code data: + +- `ann_index_pq_chunk_cache_limit` (default: `60%`) +- `ann_index_pq_chunk_cache_stale_sweep_time_sec` (default: `1800`) + +The percentage value of `ann_index_pq_chunk_cache_limit` is based on process-available memory (`mem_limit`), not total machine memory. + +## Observability + +`pq_on_disk` introduces a dedicated BE cache named `AnnIndexPqChunkCache`. + +When troubleshooting, check the following first: + +- Whether the query is actually selective enough. +- Whether the filtered rows have good locality. +- Whether the PQ chunk cache is large enough to avoid repeated disk reads. +- Whether some segments are falling back to brute force because they do not have enough rows for PQ training. + +## Usage Notes + +- `pq_on_disk` is intended for **filter-first** workloads, not for global ANN recall across the whole segment. +- It is particularly suitable for **multi-tenant vector search** where rows from many tenants are mixed in the same segment. +- It supports both `l2_distance` and `inner_product`, including Top-N and range-search style queries. +- Query result ordering must match metric semantics: `l2_distance_approximate` uses ascending order, while `inner_product_approximate` uses descending order. +- Data locality matters. It works best when rows for the same filter key are physically close so PQ chunk reads are more sequential. +- For very small segments or insufficient training data, Doris may not build the PQ index and can fall back to brute force. + +## Best Practices + +1. Choose `pq_on_disk` when the query pattern is usually **filter first, rerank second**. +2. Prefer it for **tenant-aware retrieval** such as `WHERE tenant_id = ? ORDER BY ... LIMIT N`. +3. Keep the filter column selective. The smaller the filtered candidate set, the more suitable `pq_on_disk` becomes. +4. Start with `pq_nbits = 8` unless you intentionally want a smaller code size at the cost of recall. +5. Choose `pq_m` so that `dim / pq_m` is reasonable for your model dimension and business recall target. +6. Use prepared statements for 768-D and higher query vectors to reduce SQL parsing overhead. +7. Validate on real business distributions, especially when tenant sizes are very uneven. + +## How to Choose Between `hnsw`, `ivf_on_disk`, and `pq_on_disk` + +Use `hnsw` when: + +- You need high-recall global ANN search. +- Query latency is the top priority and enough memory is available. + +Use `ivf_on_disk` when: + +- You still need a global IVF-style ANN recall model. +- Memory is limited, but the query still searches a large global vector collection. + +Use `pq_on_disk` when: + +- The query already has a highly selective scalar filter. +- Rows from different tenants or users are mixed in the same segment. +- Global ANN recall under tenant/user filtering is poor. +- You want to accelerate filtered brute-force scoring with compressed vectors. + +In short, `pq_on_disk` is not a replacement for all ANN structures. It is the right choice when the main problem is **efficient vector reranking inside a filtered subset**, especially in multi-tenant workloads. \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/overview.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/overview.md index de02c0eae3323..604e91aed1c60 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/overview.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/overview.md @@ -49,7 +49,7 @@ PROPERTIES ( "replication_num" = "1" ); ``` -- index_type: 可选 `hnsw`([Hierarchical Navigable Small World 算法](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world))、`ivf`(倒排文件索引)或 `ivf_on_disk`(倒排列表落盘并通过缓存提供查询能力的 IVF) +- index_type: 可选 `hnsw`([Hierarchical Navigable Small World 算法](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world))、`ivf`(倒排文件索引)、`ivf_on_disk`(倒排列表落盘并通过缓存提供查询能力的 IVF)或 `pq_on_disk`(将 PQ 编码向量落盘、用于过滤后向量重排加速) - metric_type: l2_distance 表示使用 L2 距离作为距离函数 - dim: 128 表示向量维度为 128 - quantizer: flat 表示按原始 float32 存储各维度 @@ -57,15 +57,15 @@ PROPERTIES ( | 参数 | 是否必填 | 支持/可选值 | 默认值 | 说明 | |------|----------|-------------|--------|------| -| `index_type` | 是 | 支持:`hnsw`、`ivf`、`ivf_on_disk` | (无) | 指定所使用的 ANN 索引算法。当前支持 HNSW、内存 IVF 和 IVF On-Disk。 | +| `index_type` | 是 | 支持:`hnsw`、`ivf`、`ivf_on_disk`、`pq_on_disk` | (无) | 指定所使用的 ANN 索引算法。当前支持 HNSW、内存 IVF、IVF On-Disk,以及面向高选择性过滤后重排的 PQ On-Disk。 | | `metric_type` | 是 | `l2_distance`,`inner_product` | (无) | 指定向量相似度/距离度量方式。L2 为欧氏距离,inner_product 可用于余弦相似时需先归一化向量。 | | `dim` | 是 | 正整数 (> 0) | (无) | 指定向量维度,后续导入的所有向量的维度必须与此一致,否则报错。 | | `nlist` | 否 | 正整数 | `1024` | IVF 的倒排桶数量。在 `index_type=ivf` 或 `index_type=ivf_on_disk` 时生效;取值越大通常有助于召回率/速度权衡,但会增加构建开销。 | | `max_degree` | 否 | 正整数 | `32` | HNSW 图中单个节点的最大邻居数(M),影响索引内存与搜索性能。 | | `ef_construction` | 否 | 正整数 | `40` | HNSW 构建阶段的候选队列大小(efConstruction),越大构图质量越好但构建更慢。 | | `quantizer` | 否 | `flat`,`sq8`,`sq4`, `pq` | `flat` | 指定向量编码/量化方式:`flat` 为原始存储,`sq8`/`sq4` 为标量量化(8/4 bit), `pq` 为乘积量化。 | -| `pq_m` | 'quantizer=pq' 时需要指定 | 正整数 | (无) | 指定将原始的高维向量分割成多少个子向量(向量维度 dim 必须能被 pq_m 整除)。 | -| `pq_nbits` | 'quantizer=pq' 时需要指定 | 正整数 | (无) | 指定每个子向量量化的比特数, 它决定了每个子空间码本的大小(k = 2 ^ pq_nbits), 在faiss中pq_nbits值一般要求不大于24。 | +| `pq_m` | `quantizer=pq` 或 `index_type=pq_on_disk` 时需要指定 | 正整数 | (无) | 指定将原始的高维向量分割成多少个子向量,向量维度 `dim` 必须能被 `pq_m` 整除。 | +| `pq_nbits` | `quantizer=pq` 时需要指定;`index_type=pq_on_disk` 时可选 | 正整数 | `pq_on_disk` 默认 `8` | 指定每个子向量量化的比特数。它决定了每个子空间码本的大小(k = 2 ^ pq_nbits),在 Faiss 中一般要求不大于 24。 | ## 如果业务需要使用 Cosine 相似度 @@ -293,6 +293,8 @@ PROPERTIES ( 量化会带来额外构建开销,原因是构建阶段需要大量距离计算,且每次计算需对量化值解码。以 128 维向量为例,随行数增长构建时间上升,SQ 相比 FLAT 可能引入约 10 倍构建成本。 +对于以 `tenant_id = ?`、`user_id = ?` 等高选择性过滤为主的查询,Doris 还提供了 [`pq_on_disk`](./pq-on-disk.md)。它不像 HNSW / IVF 那样构建面向全局召回的结构,而是通过磁盘上的 PQ 编码向量,加速过滤后候选集上的向量重排。这使它在多租户向量检索场景下尤其有价值:当一个 segment 中混合了多个租户的数据时,全局 ANN 结构在指定租户后可能召回下降,而 `pq_on_disk` 更适合这种“先过滤、后重排”的模式。 + 类似的, Doris也支持乘积量化, 不过需要注意的是在使用PQ时需要提供额外的参数: - `pq_m`: 表示将原始的高维向量分割成多少个子向量(向量维度 dim 必须能被 pq_m 整除)。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/pq-on-disk.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/pq-on-disk.md new file mode 100644 index 0000000000000..86f31062b43d9 --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/pq-on-disk.md @@ -0,0 +1,286 @@ +--- +{ + "title": "PQ On-Disk", + "language": "zh-CN", + "description": "PQ On-Disk 是 Apache Doris 面向过滤优先向量检索场景提供的向量索引形态,特别适用于多租户检索:通过在过滤后的候选集上使用 PQ 编码向量加速暴力距离计算,在较低内存占用下获得更稳定的效果。" +} +--- + + + +# Apache Doris 中的 PQ On-Disk + +`pq_on_disk` 是 Apache Doris 面向**过滤优先(filter-first)向量检索**场景提供的一种向量索引模式。它将 Product Quantization(PQ)编码后的向量存储在磁盘上,仅将 PQ codebook 和热点 chunk 保留在内存中,并在标量过滤完成后,利用压缩后的向量加速过滤结果上的暴力距离计算。 + +这个特性尤其适合**多租户向量检索**。在很多 SaaS 类业务中,不同租户的向量会被写入同一个 segment。如果直接在这些混合数据上构建全局 `hnsw` 或 `ivf` 索引,再执行 `WHERE tenant_id = ?` 这类查询,召回率往往会明显下降,因为全局召回结构是基于所有租户的混合数据构建的,而不是针对某一个租户的局部子集。`pq_on_disk` 不依赖这样的全局跨租户召回结构,而是先按租户过滤,再在过滤后的子集上通过 PQ 编码向量加速排序,因此更适合多租户场景。 + +## 适用场景 + +当查询模式通常是下面这样时,优先考虑 `pq_on_disk`: + +```sql +WHERE <高选择性过滤条件> +ORDER BY l2_distance_approximate(...) LIMIT N +``` + +常见例子包括: + +- `WHERE tenant_id = ?` +- `WHERE user_id = ?` +- `WHERE category_id = ? AND status = 'active'` +- `WHERE tag MATCH_ANY '...' + ORDER BY l2_distance_approximate(...) LIMIT N` + +这和全局 ANN 的目标不同: + +- `hnsw` 和 `ivf` 更适合在大规模向量集合上做**全局 ANN 召回**。 +- `ivf_on_disk` 仍然保留 IVF 的全局召回模型,只是将主要索引数据落盘以降低内存压力。 +- `pq_on_disk` 聚焦的是**过滤后子集上的向量重排**,即候选集已经被普通谓词显著缩小,Doris 只需要更快地对这些候选行做向量打分。 + +## 为什么它适合多租户检索 + +假设一个 segment 中混合存储了 10,000 个租户的向量。如果在这些数据上构建全局 HNSW 或 IVF 索引,而查询是: + +```sql +SELECT doc_id +FROM tenant_embeddings +WHERE tenant_id = 10001 +ORDER BY l2_distance_approximate(embedding, ) +LIMIT 20; +``` + +这个查询只关心某一个租户的数据,但全局 ANN 结构的训练、聚类或图连接都基于所有租户的混合向量。对于“全局召回”有效的图路径、邻接关系或 IVF 分桶,并不一定适合“租户过滤之后”的局部召回,因此很容易出现指定租户后召回率下降的问题。 + +`pq_on_disk` 的处理方式不同: + +1. Doris 先执行 `tenant_id = 10001` 这样的标量过滤。 +2. 得到该租户对应的候选集。 +3. 不再依赖全局 ANN 结构在这个子集内做召回,而是使用 PQ 编码向量更快地计算这些候选行的距离。 +4. PQ code 按 rowid 顺序存储在磁盘,并通过专用 chunk cache 做复用。 + +因此,当满足以下条件时,`pq_on_disk` 往往比全局 ANN 结构更合适: + +- 过滤条件具有高选择性; +- 租户过滤后的召回稳定性比全局 ANN 更重要; +- 原始 float32 向量上的暴力距离计算仍然代价较高。 + +## 快速开始 + +### 建表 + +下面的例子使用 `tenant_id` 作为主过滤列: + +```sql +CREATE TABLE tenant_embeddings ( + tenant_id BIGINT NOT NULL, + doc_id BIGINT NOT NULL, + embedding ARRAY NOT NULL, + INDEX idx_embedding (embedding) USING ANN PROPERTIES ( + "index_type" = "pq_on_disk", + "metric_type" = "l2_distance", + "dim" = "768", + "pq_m" = "96", + "pq_nbits" = "8" + ) +) ENGINE=OLAP +DUPLICATE KEY(tenant_id, doc_id) +DISTRIBUTED BY HASH(tenant_id) BUCKETS 8 +PROPERTIES ( + "replication_num" = "1" +); +``` + +### 基础查询 + +```sql +SELECT doc_id, + l2_distance_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) AS score +FROM tenant_embeddings +WHERE tenant_id = 10001 +ORDER BY score ASC +LIMIT 20; +``` + +这正是 `pq_on_disk` 的核心目标场景:先过滤,再在过滤后的结果里执行高效的 Top-N 向量排序。 + +## 工作原理 + +从实现角度看,`pq_on_disk` 的执行过程大致如下: + +1. Doris 为 segment 训练 PQ codebook。 +2. 原始向量被编码为紧凑的 PQ codes。 +3. PQ codes 按 rowid 顺序写入磁盘。 +4. 查询时,Doris 先计算普通谓词过滤。 +5. 对于通过过滤的行,再加载对应的 PQ chunk,并基于 PQ code 计算近似距离,而不是直接对原始 float32 向量做全量暴力计算。 + +因此,`pq_on_disk` 更适合被理解为**基于 PQ 的过滤后暴力计算加速**,而不是像 HNSW / IVF 那样的全局召回结构。 + +## 用户接口 + +### 1)索引 DDL + +通过 `index_type="pq_on_disk"` 创建索引: + +```sql +CREATE TABLE image_pool ( + user_id BIGINT NOT NULL, + photo_id BIGINT NOT NULL, + embedding ARRAY NOT NULL, + INDEX idx_emb (embedding) USING ANN PROPERTIES ( + "index_type" = "pq_on_disk", + "metric_type" = "l2_distance", + "dim" = "768", + "pq_m" = "96", + "pq_nbits" = "8" + ) +) ENGINE=OLAP +DUPLICATE KEY(user_id, photo_id) +DISTRIBUTED BY HASH(user_id) BUCKETS 8 +PROPERTIES ("replication_num" = "1"); +``` + +### 2)典型查询模式 + +过滤后的 Top-N: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = 10001 +ORDER BY l2_distance_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) ASC +LIMIT 20; +``` + +Prepared Statement 风格查询: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = ? +ORDER BY l2_distance_approximate(embedding, CAST(? AS ARRAY)) ASC +LIMIT 20; +``` + +如果使用内积,则按降序排序: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = 10001 +ORDER BY inner_product_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) DESC +LIMIT 20; +``` + +也支持 range search: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = 10001 + AND l2_distance_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) < 500.0 +ORDER BY photo_id; +``` + +## 参数与约束 + +### 索引参数 + +| 属性 | 是否必填 | 默认值 | 说明 | +|---|---|---|---| +| `index_type` | 是 | - | 必须为 `pq_on_disk`。 | +| `metric_type` | 是 | - | `l2_distance` 或 `inner_product`。 | +| `dim` | 是 | - | 向量维度。 | +| `pq_m` | 是 | - | PQ 子量化器数量,必须能整除 `dim`。 | +| `pq_nbits` | 否 | `8` | 每个子量化编码的 bit 数。 | + +### 训练要求 + +`pq_on_disk` 需要足够的行数来训练 PQ codebook。最小训练行数为: + +```text +(1 << pq_nbits) * 100 +``` + +例如: + +- `pq_nbits = 8` 时,至少需要 `25600` 行训练数据; +- `pq_nbits = 4` 时,至少需要 `1600` 行训练数据。 + +如果某个 segment 的数据量不足以训练 PQ 索引,Doris 可能会对该 segment 回退为暴力搜索。 + +## BE 缓存配置 + +`pq_on_disk` 使用专用的 PQ chunk cache: + +- `ann_index_pq_chunk_cache_limit`(默认:`60%`) +- `ann_index_pq_chunk_cache_stale_sweep_time_sec`(默认:`1800`) + +其中 `ann_index_pq_chunk_cache_limit` 的百分比基准是 BE 进程可用内存(受 `mem_limit` 约束),不是整机物理内存。 + +## 可观测性 + +`pq_on_disk` 引入了专用 BE 缓存 `AnnIndexPqChunkCache`。 + +排查问题时,建议优先关注: + +- 查询是否真的足够高选择性; +- 过滤后的行是否具备较好的物理局部性; +- PQ chunk cache 是否足够大,是否频繁发生重复磁盘读取; +- 某些 segment 是否因为训练数据不足而回退为暴力搜索。 + +## 使用说明 + +- `pq_on_disk` 面向的是**过滤优先**的向量检索,而不是对整个 segment 做全局 ANN 召回。 +- 它尤其适合**多租户向量检索**,即多个租户的数据混合存储在同一个 segment 中的场景。 +- 它同时支持 `l2_distance` 和 `inner_product`,也支持 Top-N 与 range search 风格的查询。 +- 查询时排序方向要和度量语义一致:`l2_distance_approximate` 用升序,`inner_product_approximate` 用降序。 +- 数据局部性非常重要。如果同一过滤键对应的数据在物理上更连续,PQ chunk 读取就更容易形成顺序 I/O。 +- 对于非常小的 segment 或训练样本不足的 segment,Doris 可能不会真正构建 PQ 索引,而是回退为暴力搜索。 + +## 最佳实践 + +1. 当主要查询模式是**先过滤,后重排**时,优先考虑 `pq_on_disk`。 +2. 对于 `WHERE tenant_id = ? ORDER BY ... LIMIT N` 这类**租户级检索**,优先评估 `pq_on_disk`。 +3. 让过滤列尽可能保持高选择性。过滤后候选集越小,`pq_on_disk` 越能发挥优势。 +4. 除非明确要用更小 code size 换取更低存储,否则建议从 `pq_nbits = 8` 开始。 +5. 选择 `pq_m` 时,要结合向量维度、模型特征以及实际召回目标综合评估。 +6. 对于 768 维及以上查询向量,建议使用 prepared statement,减少 SQL 解析开销。 +7. 在上线前务必基于真实业务分布进行验证,尤其是在不同租户数据量差异较大时。 + +## 如何在 `hnsw`、`ivf_on_disk` 与 `pq_on_disk` 之间选择 + +以下场景更适合 `hnsw`: + +- 需要高召回的全局 ANN 搜索; +- 查询延迟最优先,且内存资源足够。 + +以下场景更适合 `ivf_on_disk`: + +- 仍然需要基于 IVF 的全局 ANN 召回模型; +- 内存有限,但查询仍然面向大规模全局向量集合。 + +以下场景更适合 `pq_on_disk`: + +- 查询本身已经带有高选择性的标量过滤条件; +- 不同租户或不同用户的数据混合存储在同一个 segment 中; +- 指定租户或用户过滤后,全局 ANN 的召回效果不理想; +- 希望通过压缩向量来加速过滤后候选集上的暴力距离计算。 + +可以简单理解为:`pq_on_disk` 并不是替代所有 ANN 结构的统一方案,而是当主要问题变成**如何在过滤后的子集内高效完成向量重排**时,尤其是在多租户场景下,更合适的选择。 \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/overview.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/overview.md index de02c0eae3323..604e91aed1c60 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/overview.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/overview.md @@ -49,7 +49,7 @@ PROPERTIES ( "replication_num" = "1" ); ``` -- index_type: 可选 `hnsw`([Hierarchical Navigable Small World 算法](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world))、`ivf`(倒排文件索引)或 `ivf_on_disk`(倒排列表落盘并通过缓存提供查询能力的 IVF) +- index_type: 可选 `hnsw`([Hierarchical Navigable Small World 算法](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world))、`ivf`(倒排文件索引)、`ivf_on_disk`(倒排列表落盘并通过缓存提供查询能力的 IVF)或 `pq_on_disk`(将 PQ 编码向量落盘、用于过滤后向量重排加速) - metric_type: l2_distance 表示使用 L2 距离作为距离函数 - dim: 128 表示向量维度为 128 - quantizer: flat 表示按原始 float32 存储各维度 @@ -57,15 +57,15 @@ PROPERTIES ( | 参数 | 是否必填 | 支持/可选值 | 默认值 | 说明 | |------|----------|-------------|--------|------| -| `index_type` | 是 | 支持:`hnsw`、`ivf`、`ivf_on_disk` | (无) | 指定所使用的 ANN 索引算法。当前支持 HNSW、内存 IVF 和 IVF On-Disk。 | +| `index_type` | 是 | 支持:`hnsw`、`ivf`、`ivf_on_disk`、`pq_on_disk` | (无) | 指定所使用的 ANN 索引算法。当前支持 HNSW、内存 IVF、IVF On-Disk,以及面向高选择性过滤后重排的 PQ On-Disk。 | | `metric_type` | 是 | `l2_distance`,`inner_product` | (无) | 指定向量相似度/距离度量方式。L2 为欧氏距离,inner_product 可用于余弦相似时需先归一化向量。 | | `dim` | 是 | 正整数 (> 0) | (无) | 指定向量维度,后续导入的所有向量的维度必须与此一致,否则报错。 | | `nlist` | 否 | 正整数 | `1024` | IVF 的倒排桶数量。在 `index_type=ivf` 或 `index_type=ivf_on_disk` 时生效;取值越大通常有助于召回率/速度权衡,但会增加构建开销。 | | `max_degree` | 否 | 正整数 | `32` | HNSW 图中单个节点的最大邻居数(M),影响索引内存与搜索性能。 | | `ef_construction` | 否 | 正整数 | `40` | HNSW 构建阶段的候选队列大小(efConstruction),越大构图质量越好但构建更慢。 | | `quantizer` | 否 | `flat`,`sq8`,`sq4`, `pq` | `flat` | 指定向量编码/量化方式:`flat` 为原始存储,`sq8`/`sq4` 为标量量化(8/4 bit), `pq` 为乘积量化。 | -| `pq_m` | 'quantizer=pq' 时需要指定 | 正整数 | (无) | 指定将原始的高维向量分割成多少个子向量(向量维度 dim 必须能被 pq_m 整除)。 | -| `pq_nbits` | 'quantizer=pq' 时需要指定 | 正整数 | (无) | 指定每个子向量量化的比特数, 它决定了每个子空间码本的大小(k = 2 ^ pq_nbits), 在faiss中pq_nbits值一般要求不大于24。 | +| `pq_m` | `quantizer=pq` 或 `index_type=pq_on_disk` 时需要指定 | 正整数 | (无) | 指定将原始的高维向量分割成多少个子向量,向量维度 `dim` 必须能被 `pq_m` 整除。 | +| `pq_nbits` | `quantizer=pq` 时需要指定;`index_type=pq_on_disk` 时可选 | 正整数 | `pq_on_disk` 默认 `8` | 指定每个子向量量化的比特数。它决定了每个子空间码本的大小(k = 2 ^ pq_nbits),在 Faiss 中一般要求不大于 24。 | ## 如果业务需要使用 Cosine 相似度 @@ -293,6 +293,8 @@ PROPERTIES ( 量化会带来额外构建开销,原因是构建阶段需要大量距离计算,且每次计算需对量化值解码。以 128 维向量为例,随行数增长构建时间上升,SQ 相比 FLAT 可能引入约 10 倍构建成本。 +对于以 `tenant_id = ?`、`user_id = ?` 等高选择性过滤为主的查询,Doris 还提供了 [`pq_on_disk`](./pq-on-disk.md)。它不像 HNSW / IVF 那样构建面向全局召回的结构,而是通过磁盘上的 PQ 编码向量,加速过滤后候选集上的向量重排。这使它在多租户向量检索场景下尤其有价值:当一个 segment 中混合了多个租户的数据时,全局 ANN 结构在指定租户后可能召回下降,而 `pq_on_disk` 更适合这种“先过滤、后重排”的模式。 + 类似的, Doris也支持乘积量化, 不过需要注意的是在使用PQ时需要提供额外的参数: - `pq_m`: 表示将原始的高维向量分割成多少个子向量(向量维度 dim 必须能被 pq_m 整除)。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/pq-on-disk.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/pq-on-disk.md new file mode 100644 index 0000000000000..86f31062b43d9 --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/pq-on-disk.md @@ -0,0 +1,286 @@ +--- +{ + "title": "PQ On-Disk", + "language": "zh-CN", + "description": "PQ On-Disk 是 Apache Doris 面向过滤优先向量检索场景提供的向量索引形态,特别适用于多租户检索:通过在过滤后的候选集上使用 PQ 编码向量加速暴力距离计算,在较低内存占用下获得更稳定的效果。" +} +--- + + + +# Apache Doris 中的 PQ On-Disk + +`pq_on_disk` 是 Apache Doris 面向**过滤优先(filter-first)向量检索**场景提供的一种向量索引模式。它将 Product Quantization(PQ)编码后的向量存储在磁盘上,仅将 PQ codebook 和热点 chunk 保留在内存中,并在标量过滤完成后,利用压缩后的向量加速过滤结果上的暴力距离计算。 + +这个特性尤其适合**多租户向量检索**。在很多 SaaS 类业务中,不同租户的向量会被写入同一个 segment。如果直接在这些混合数据上构建全局 `hnsw` 或 `ivf` 索引,再执行 `WHERE tenant_id = ?` 这类查询,召回率往往会明显下降,因为全局召回结构是基于所有租户的混合数据构建的,而不是针对某一个租户的局部子集。`pq_on_disk` 不依赖这样的全局跨租户召回结构,而是先按租户过滤,再在过滤后的子集上通过 PQ 编码向量加速排序,因此更适合多租户场景。 + +## 适用场景 + +当查询模式通常是下面这样时,优先考虑 `pq_on_disk`: + +```sql +WHERE <高选择性过滤条件> +ORDER BY l2_distance_approximate(...) LIMIT N +``` + +常见例子包括: + +- `WHERE tenant_id = ?` +- `WHERE user_id = ?` +- `WHERE category_id = ? AND status = 'active'` +- `WHERE tag MATCH_ANY '...' + ORDER BY l2_distance_approximate(...) LIMIT N` + +这和全局 ANN 的目标不同: + +- `hnsw` 和 `ivf` 更适合在大规模向量集合上做**全局 ANN 召回**。 +- `ivf_on_disk` 仍然保留 IVF 的全局召回模型,只是将主要索引数据落盘以降低内存压力。 +- `pq_on_disk` 聚焦的是**过滤后子集上的向量重排**,即候选集已经被普通谓词显著缩小,Doris 只需要更快地对这些候选行做向量打分。 + +## 为什么它适合多租户检索 + +假设一个 segment 中混合存储了 10,000 个租户的向量。如果在这些数据上构建全局 HNSW 或 IVF 索引,而查询是: + +```sql +SELECT doc_id +FROM tenant_embeddings +WHERE tenant_id = 10001 +ORDER BY l2_distance_approximate(embedding, ) +LIMIT 20; +``` + +这个查询只关心某一个租户的数据,但全局 ANN 结构的训练、聚类或图连接都基于所有租户的混合向量。对于“全局召回”有效的图路径、邻接关系或 IVF 分桶,并不一定适合“租户过滤之后”的局部召回,因此很容易出现指定租户后召回率下降的问题。 + +`pq_on_disk` 的处理方式不同: + +1. Doris 先执行 `tenant_id = 10001` 这样的标量过滤。 +2. 得到该租户对应的候选集。 +3. 不再依赖全局 ANN 结构在这个子集内做召回,而是使用 PQ 编码向量更快地计算这些候选行的距离。 +4. PQ code 按 rowid 顺序存储在磁盘,并通过专用 chunk cache 做复用。 + +因此,当满足以下条件时,`pq_on_disk` 往往比全局 ANN 结构更合适: + +- 过滤条件具有高选择性; +- 租户过滤后的召回稳定性比全局 ANN 更重要; +- 原始 float32 向量上的暴力距离计算仍然代价较高。 + +## 快速开始 + +### 建表 + +下面的例子使用 `tenant_id` 作为主过滤列: + +```sql +CREATE TABLE tenant_embeddings ( + tenant_id BIGINT NOT NULL, + doc_id BIGINT NOT NULL, + embedding ARRAY NOT NULL, + INDEX idx_embedding (embedding) USING ANN PROPERTIES ( + "index_type" = "pq_on_disk", + "metric_type" = "l2_distance", + "dim" = "768", + "pq_m" = "96", + "pq_nbits" = "8" + ) +) ENGINE=OLAP +DUPLICATE KEY(tenant_id, doc_id) +DISTRIBUTED BY HASH(tenant_id) BUCKETS 8 +PROPERTIES ( + "replication_num" = "1" +); +``` + +### 基础查询 + +```sql +SELECT doc_id, + l2_distance_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) AS score +FROM tenant_embeddings +WHERE tenant_id = 10001 +ORDER BY score ASC +LIMIT 20; +``` + +这正是 `pq_on_disk` 的核心目标场景:先过滤,再在过滤后的结果里执行高效的 Top-N 向量排序。 + +## 工作原理 + +从实现角度看,`pq_on_disk` 的执行过程大致如下: + +1. Doris 为 segment 训练 PQ codebook。 +2. 原始向量被编码为紧凑的 PQ codes。 +3. PQ codes 按 rowid 顺序写入磁盘。 +4. 查询时,Doris 先计算普通谓词过滤。 +5. 对于通过过滤的行,再加载对应的 PQ chunk,并基于 PQ code 计算近似距离,而不是直接对原始 float32 向量做全量暴力计算。 + +因此,`pq_on_disk` 更适合被理解为**基于 PQ 的过滤后暴力计算加速**,而不是像 HNSW / IVF 那样的全局召回结构。 + +## 用户接口 + +### 1)索引 DDL + +通过 `index_type="pq_on_disk"` 创建索引: + +```sql +CREATE TABLE image_pool ( + user_id BIGINT NOT NULL, + photo_id BIGINT NOT NULL, + embedding ARRAY NOT NULL, + INDEX idx_emb (embedding) USING ANN PROPERTIES ( + "index_type" = "pq_on_disk", + "metric_type" = "l2_distance", + "dim" = "768", + "pq_m" = "96", + "pq_nbits" = "8" + ) +) ENGINE=OLAP +DUPLICATE KEY(user_id, photo_id) +DISTRIBUTED BY HASH(user_id) BUCKETS 8 +PROPERTIES ("replication_num" = "1"); +``` + +### 2)典型查询模式 + +过滤后的 Top-N: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = 10001 +ORDER BY l2_distance_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) ASC +LIMIT 20; +``` + +Prepared Statement 风格查询: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = ? +ORDER BY l2_distance_approximate(embedding, CAST(? AS ARRAY)) ASC +LIMIT 20; +``` + +如果使用内积,则按降序排序: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = 10001 +ORDER BY inner_product_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) DESC +LIMIT 20; +``` + +也支持 range search: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = 10001 + AND l2_distance_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) < 500.0 +ORDER BY photo_id; +``` + +## 参数与约束 + +### 索引参数 + +| 属性 | 是否必填 | 默认值 | 说明 | +|---|---|---|---| +| `index_type` | 是 | - | 必须为 `pq_on_disk`。 | +| `metric_type` | 是 | - | `l2_distance` 或 `inner_product`。 | +| `dim` | 是 | - | 向量维度。 | +| `pq_m` | 是 | - | PQ 子量化器数量,必须能整除 `dim`。 | +| `pq_nbits` | 否 | `8` | 每个子量化编码的 bit 数。 | + +### 训练要求 + +`pq_on_disk` 需要足够的行数来训练 PQ codebook。最小训练行数为: + +```text +(1 << pq_nbits) * 100 +``` + +例如: + +- `pq_nbits = 8` 时,至少需要 `25600` 行训练数据; +- `pq_nbits = 4` 时,至少需要 `1600` 行训练数据。 + +如果某个 segment 的数据量不足以训练 PQ 索引,Doris 可能会对该 segment 回退为暴力搜索。 + +## BE 缓存配置 + +`pq_on_disk` 使用专用的 PQ chunk cache: + +- `ann_index_pq_chunk_cache_limit`(默认:`60%`) +- `ann_index_pq_chunk_cache_stale_sweep_time_sec`(默认:`1800`) + +其中 `ann_index_pq_chunk_cache_limit` 的百分比基准是 BE 进程可用内存(受 `mem_limit` 约束),不是整机物理内存。 + +## 可观测性 + +`pq_on_disk` 引入了专用 BE 缓存 `AnnIndexPqChunkCache`。 + +排查问题时,建议优先关注: + +- 查询是否真的足够高选择性; +- 过滤后的行是否具备较好的物理局部性; +- PQ chunk cache 是否足够大,是否频繁发生重复磁盘读取; +- 某些 segment 是否因为训练数据不足而回退为暴力搜索。 + +## 使用说明 + +- `pq_on_disk` 面向的是**过滤优先**的向量检索,而不是对整个 segment 做全局 ANN 召回。 +- 它尤其适合**多租户向量检索**,即多个租户的数据混合存储在同一个 segment 中的场景。 +- 它同时支持 `l2_distance` 和 `inner_product`,也支持 Top-N 与 range search 风格的查询。 +- 查询时排序方向要和度量语义一致:`l2_distance_approximate` 用升序,`inner_product_approximate` 用降序。 +- 数据局部性非常重要。如果同一过滤键对应的数据在物理上更连续,PQ chunk 读取就更容易形成顺序 I/O。 +- 对于非常小的 segment 或训练样本不足的 segment,Doris 可能不会真正构建 PQ 索引,而是回退为暴力搜索。 + +## 最佳实践 + +1. 当主要查询模式是**先过滤,后重排**时,优先考虑 `pq_on_disk`。 +2. 对于 `WHERE tenant_id = ? ORDER BY ... LIMIT N` 这类**租户级检索**,优先评估 `pq_on_disk`。 +3. 让过滤列尽可能保持高选择性。过滤后候选集越小,`pq_on_disk` 越能发挥优势。 +4. 除非明确要用更小 code size 换取更低存储,否则建议从 `pq_nbits = 8` 开始。 +5. 选择 `pq_m` 时,要结合向量维度、模型特征以及实际召回目标综合评估。 +6. 对于 768 维及以上查询向量,建议使用 prepared statement,减少 SQL 解析开销。 +7. 在上线前务必基于真实业务分布进行验证,尤其是在不同租户数据量差异较大时。 + +## 如何在 `hnsw`、`ivf_on_disk` 与 `pq_on_disk` 之间选择 + +以下场景更适合 `hnsw`: + +- 需要高召回的全局 ANN 搜索; +- 查询延迟最优先,且内存资源足够。 + +以下场景更适合 `ivf_on_disk`: + +- 仍然需要基于 IVF 的全局 ANN 召回模型; +- 内存有限,但查询仍然面向大规模全局向量集合。 + +以下场景更适合 `pq_on_disk`: + +- 查询本身已经带有高选择性的标量过滤条件; +- 不同租户或不同用户的数据混合存储在同一个 segment 中; +- 指定租户或用户过滤后,全局 ANN 的召回效果不理想; +- 希望通过压缩向量来加速过滤后候选集上的暴力距离计算。 + +可以简单理解为:`pq_on_disk` 并不是替代所有 ANN 结构的统一方案,而是当主要问题变成**如何在过滤后的子集内高效完成向量重排**时,尤其是在多租户场景下,更合适的选择。 \ No newline at end of file diff --git a/sidebars.ts b/sidebars.ts index 5f05731a615d4..239a0b7562e2a 100644 --- a/sidebars.ts +++ b/sidebars.ts @@ -359,6 +359,7 @@ const sidebars: SidebarsConfig = { 'ai/vector-search/hnsw', 'ai/vector-search/ivf', 'ai/vector-search/ivf-on-disk', + 'ai/vector-search/pq-on-disk', 'ai/vector-search/index-management', 'ai/vector-search/resource-estimation', 'ai/vector-search/quantization-survey', diff --git a/versioned_docs/version-4.x/ai/vector-search/overview.md b/versioned_docs/version-4.x/ai/vector-search/overview.md index d04e99932cd05..8dbec97274113 100644 --- a/versioned_docs/version-4.x/ai/vector-search/overview.md +++ b/versioned_docs/version-4.x/ai/vector-search/overview.md @@ -58,22 +58,22 @@ PROPERTIES ( ); ``` -- index_type: `hnsw` (for [Hierarchical Navigable Small World](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world)), `ivf` (for inverted file), or `ivf_on_disk` (for IVF with inverted lists stored on disk and served through cache) +- index_type: `hnsw` (for [Hierarchical Navigable Small World](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world)), `ivf` (for inverted file), `ivf_on_disk` (for IVF with inverted lists stored on disk and served through cache), or `pq_on_disk` (for filter-first reranking accelerated by PQ-encoded vectors stored on disk) - metric_type: `l2_distance` means using L2 distance as the distance function - dim: `128` means the vector dimension is 128 - quantizer: `flat` means each vector dimension is stored as original float32 | Parameter | Required | Supported/Options | Default | Description | |-----------|----------|-------------------|---------|-------------| -| `index_type` | Yes | `hnsw`, `ivf`, `ivf_on_disk` | (none) | ANN index algorithm. Supports HNSW, in-memory IVF, and IVF On-Disk. | +| `index_type` | Yes | `hnsw`, `ivf`, `ivf_on_disk`, `pq_on_disk` | (none) | ANN index algorithm. Supports HNSW, in-memory IVF, IVF On-Disk, and PQ On-Disk for selective filter-first reranking. | | `metric_type` | Yes | `l2_distance`, `inner_product` | (none) | Vector similarity/distance metric. L2 = Euclidean; inner_product can approximate cosine if vectors are normalized. | | `dim` | Yes | Positive integer (> 0) | (none) | Vector dimension. All imported vectors must match or an error is raised. | | `nlist` | No | Positive integer | `1024` | IVF inverted-list count. Effective when `index_type=ivf` or `index_type=ivf_on_disk`; larger values may improve recall/speed trade-offs but increase build overhead. | | `max_degree` | No | Positive integer | `32` | HNSW M (max neighbors per node). Affects index memory and search performance. | | `ef_construction` | No | Positive integer | `40` | HNSW efConstruction (candidate queue size during build). Larger gives better quality but slower build. | | `quantizer` | No | `flat`, `sq8`, `sq4`, `pq` | `flat` | Vector encoding/quantization: `flat` = raw; `sq8`/`sq4` = scalar quantization (8/4 bit), `pq` = product quantization to reduce memory. | -| `pq_m` | Required when 'quantizer=pq' | Positive integer | (none) | Specifies how many subvectors are used (vector dimension dim must be divisible by pq_m). | -| `pq_nbits` | Required when 'quantizer=pq' | Positive integer | (none) | The number of bits used to represent each subvector, in faiss pq_nbits is generally required to be no greater than 24. | +| `pq_m` | Required when `quantizer=pq` or `index_type=pq_on_disk` | Positive integer | (none) | Number of subvectors. The vector dimension `dim` must be divisible by `pq_m`. | +| `pq_nbits` | Required when `quantizer=pq`; optional when `index_type=pq_on_disk` | Positive integer | `8` for `pq_on_disk` | Number of bits used to represent each subvector. In Faiss, `pq_nbits` is generally required to be no greater than 24. | ## If You Need Cosine Similarity @@ -313,6 +313,8 @@ On 768-D Cohere-MEDIUM-1M and Cohere-LARGE-10M datasets, SQ8 reduces index size Quantization introduces extra build-time overhead because each distance computation must decode quantized values. For 128-D vectors, build time increases with row count; SQ vs. FLAT can be up to ~10× slower to build. +For workloads dominated by highly selective filters such as `tenant_id = ?` or `user_id = ?`, Doris also provides [`pq_on_disk`](./pq-on-disk.md). Unlike global ANN structures such as HNSW or IVF, `pq_on_disk` is designed to accelerate vector reranking inside the filtered subset by using PQ-encoded vectors stored on disk. This makes it especially useful for multi-tenant vector search, where global ANN structures built on mixed-tenant segments may suffer recall degradation after tenant filtering. + Similarly, Doris also supports product quantization, but note that when using PQ, additional parameters need to be provided: - `pq_m`: Indicates how many sub-vectors to split the original high-dimensional vector into (vector dimension dim must be divisible by pq_m). diff --git a/versioned_docs/version-4.x/ai/vector-search/pq-on-disk.md b/versioned_docs/version-4.x/ai/vector-search/pq-on-disk.md new file mode 100644 index 0000000000000..1c6cd204a6379 --- /dev/null +++ b/versioned_docs/version-4.x/ai/vector-search/pq-on-disk.md @@ -0,0 +1,286 @@ +--- +{ + "title": "PQ On-Disk", + "language": "en", + "description": "PQ On-Disk is a disk-backed vector reranking mode in Apache Doris. It is designed for selective filter-first workloads such as multi-tenant vector search, and uses PQ-encoded vectors to accelerate brute-force distance evaluation on filtered rows." +} +--- + + + +# PQ On-Disk in Apache Doris + +`pq_on_disk` is a vector index mode in Apache Doris for **filter-first vector search**. It stores Product Quantization (PQ) codes on disk, keeps only the PQ codebook and hot chunks in memory, and uses the compressed vectors to accelerate brute-force-style distance evaluation on rows that have already passed scalar filtering. + +This feature is especially useful in **multi-tenant vector search**. In many SaaS-style workloads, vectors from many tenants are stored together in the same segment. If you build a global `hnsw` or `ivf` index on that mixed data and then query with predicates such as `WHERE tenant_id = ?`, the ANN recall can degrade significantly because the global recall structure was built across all tenants rather than for one tenant's local subset. `pq_on_disk` avoids this problem by not depending on a global cross-tenant recall structure. Instead, Doris first applies the tenant filter, then uses PQ codes to accelerate vector scoring inside the filtered subset. + +## When to Use PQ On-Disk + +Use `pq_on_disk` when your query pattern is usually: + +```sql +WHERE +ORDER BY l2_distance_approximate(...) LIMIT N +``` + +Typical examples include: + +- `WHERE tenant_id = ?` +- `WHERE user_id = ?` +- `WHERE category_id = ? AND status = 'active'` +- `WHERE tag MATCH_ANY '...' + ORDER BY l2_distance_approximate(...) LIMIT N` + +This is a different operating point from global ANN search: + +- `hnsw` and `ivf` are designed for **global ANN recall** across a large vector collection. +- `ivf_on_disk` keeps the IVF recall model but moves the main IVF data to disk to reduce memory pressure. +- `pq_on_disk` is designed for **filtered-subset reranking**, where the candidate set is already narrowed down by ordinary predicates and Doris needs a faster way to score those rows. + +## Why It Helps in Multi-Tenant Search + +Suppose a segment contains vectors from 10,000 tenants. A global HNSW or IVF index is built over all rows in the segment. If the query is: + +```sql +SELECT doc_id +FROM tenant_embeddings +WHERE tenant_id = 10001 +ORDER BY l2_distance_approximate(embedding, ) +LIMIT 20; +``` + +The query only cares about one tenant's rows, but the global ANN structure was trained or connected using vectors from all tenants. The nearest paths, graph edges, or IVF partitions that are good for global recall are not necessarily good for recall **after tenant filtering**. + +`pq_on_disk` addresses this case differently: + +1. Doris first applies the scalar predicate such as `tenant_id = 10001`. +2. It obtains a filtered candidate set for that tenant. +3. Instead of computing full float32 brute-force distances on every filtered row, Doris uses PQ-encoded vectors to evaluate distances much faster. +4. PQ code data is read from disk in rowid order and reused through a dedicated chunk cache. + +As a result, `pq_on_disk` is often a better fit than global ANN structures when: + +- the filter is highly selective, +- recall under post-filter/global ANN is unstable, +- and full brute-force over raw vectors is still too expensive. + +## Quick Start + +### Create a table + +The following example uses `tenant_id` as the main filter column: + +```sql +CREATE TABLE tenant_embeddings ( + tenant_id BIGINT NOT NULL, + doc_id BIGINT NOT NULL, + embedding ARRAY NOT NULL, + INDEX idx_embedding (embedding) USING ANN PROPERTIES ( + "index_type" = "pq_on_disk", + "metric_type" = "l2_distance", + "dim" = "768", + "pq_m" = "96", + "pq_nbits" = "8" + ) +) ENGINE=OLAP +DUPLICATE KEY(tenant_id, doc_id) +DISTRIBUTED BY HASH(tenant_id) BUCKETS 8 +PROPERTIES ( + "replication_num" = "1" +); +``` + +### Basic query + +```sql +SELECT doc_id, + l2_distance_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) AS score +FROM tenant_embeddings +WHERE tenant_id = 10001 +ORDER BY score ASC +LIMIT 20; +``` + +This query pattern is the primary target of `pq_on_disk`: filter first, then do fast vector Top-N inside the filtered rows. + +## How PQ On-Disk Works + +At a high level: + +1. Doris trains a PQ codebook for the segment. +2. Raw vectors are encoded into compact PQ codes. +3. PQ codes are stored on disk in rowid order. +4. At query time, Doris first evaluates ordinary predicates. +5. For rows that survive filtering, Doris loads the corresponding PQ chunks and computes approximate distances using PQ codes instead of full raw vectors. + +So `pq_on_disk` is best understood as **PQ-accelerated filtered brute-force**, rather than a global ANN recall structure like HNSW or IVF. + +## User-Facing Interfaces + +### 1) Index DDL + +Use `index_type="pq_on_disk"` in ANN index properties. + +```sql +CREATE TABLE image_pool ( + user_id BIGINT NOT NULL, + photo_id BIGINT NOT NULL, + embedding ARRAY NOT NULL, + INDEX idx_emb (embedding) USING ANN PROPERTIES ( + "index_type" = "pq_on_disk", + "metric_type" = "l2_distance", + "dim" = "768", + "pq_m" = "96", + "pq_nbits" = "8" + ) +) ENGINE=OLAP +DUPLICATE KEY(user_id, photo_id) +DISTRIBUTED BY HASH(user_id) BUCKETS 8 +PROPERTIES ("replication_num" = "1"); +``` + +### 2) Typical query patterns + +Top-N after filtering: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = 10001 +ORDER BY l2_distance_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) ASC +LIMIT 20; +``` + +Prepared-statement style query: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = ? +ORDER BY l2_distance_approximate(embedding, CAST(? AS ARRAY)) ASC +LIMIT 20; +``` + +For inner-product search, sort in descending order: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = 10001 +ORDER BY inner_product_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) DESC +LIMIT 20; +``` + +Range search is also supported: + +```sql +SELECT photo_id +FROM image_pool +WHERE user_id = 10001 + AND l2_distance_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) < 500.0 +ORDER BY photo_id; +``` + +## Parameters and Constraints + +### Index parameters + +| Property | Required | Default | Description | +|---|---|---|---| +| `index_type` | Yes | - | Must be `pq_on_disk`. | +| `metric_type` | Yes | - | `l2_distance` or `inner_product`. | +| `dim` | Yes | - | Vector dimension. | +| `pq_m` | Yes | - | Number of PQ subquantizers. Must divide `dim`. | +| `pq_nbits` | No | `8` | Number of bits per subquantizer code. | + +### Training behavior + +`pq_on_disk` needs enough rows to train the PQ codebook. The minimum training row count is: + +```text +(1 << pq_nbits) * 100 +``` + +Examples: + +- `pq_nbits = 8` requires at least `25600` training rows. +- `pq_nbits = 4` requires at least `1600` training rows. + +If a segment does not have enough rows to train the PQ index, Doris can fall back to brute-force search for that segment. + +## BE Cache Configuration + +`pq_on_disk` uses a dedicated chunk cache for PQ code data: + +- `ann_index_pq_chunk_cache_limit` (default: `60%`) +- `ann_index_pq_chunk_cache_stale_sweep_time_sec` (default: `1800`) + +The percentage value of `ann_index_pq_chunk_cache_limit` is based on process-available memory (`mem_limit`), not total machine memory. + +## Observability + +`pq_on_disk` introduces a dedicated BE cache named `AnnIndexPqChunkCache`. + +When troubleshooting, check the following first: + +- Whether the query is actually selective enough. +- Whether the filtered rows have good locality. +- Whether the PQ chunk cache is large enough to avoid repeated disk reads. +- Whether some segments are falling back to brute force because they do not have enough rows for PQ training. + +## Usage Notes + +- `pq_on_disk` is intended for **filter-first** workloads, not for global ANN recall across the whole segment. +- It is particularly suitable for **multi-tenant vector search** where rows from many tenants are mixed in the same segment. +- It supports both `l2_distance` and `inner_product`, including Top-N and range-search style queries. +- Query result ordering must match metric semantics: `l2_distance_approximate` uses ascending order, while `inner_product_approximate` uses descending order. +- Data locality matters. It works best when rows for the same filter key are physically close so PQ chunk reads are more sequential. +- For very small segments or insufficient training data, Doris may not build the PQ index and can fall back to brute force. + +## Best Practices + +1. Choose `pq_on_disk` when the query pattern is usually **filter first, rerank second**. +2. Prefer it for **tenant-aware retrieval** such as `WHERE tenant_id = ? ORDER BY ... LIMIT N`. +3. Keep the filter column selective. The smaller the filtered candidate set, the more suitable `pq_on_disk` becomes. +4. Start with `pq_nbits = 8` unless you intentionally want a smaller code size at the cost of recall. +5. Choose `pq_m` so that `dim / pq_m` is reasonable for your model dimension and business recall target. +6. Use prepared statements for 768-D and higher query vectors to reduce SQL parsing overhead. +7. Validate on real business distributions, especially when tenant sizes are very uneven. + +## How to Choose Between `hnsw`, `ivf_on_disk`, and `pq_on_disk` + +Use `hnsw` when: + +- You need high-recall global ANN search. +- Query latency is the top priority and enough memory is available. + +Use `ivf_on_disk` when: + +- You still need a global IVF-style ANN recall model. +- Memory is limited, but the query still searches a large global vector collection. + +Use `pq_on_disk` when: + +- The query already has a highly selective scalar filter. +- Rows from different tenants or users are mixed in the same segment. +- Global ANN recall under tenant/user filtering is poor. +- You want to accelerate filtered brute-force scoring with compressed vectors. + +In short, `pq_on_disk` is not a replacement for all ANN structures. It is the right choice when the main problem is **efficient vector reranking inside a filtered subset**, especially in multi-tenant workloads. \ No newline at end of file diff --git a/versioned_sidebars/version-4.x-sidebars.json b/versioned_sidebars/version-4.x-sidebars.json index cf0ed80b49825..1cec52e08e7b8 100644 --- a/versioned_sidebars/version-4.x-sidebars.json +++ b/versioned_sidebars/version-4.x-sidebars.json @@ -359,6 +359,8 @@ "ai/vector-search/practical-guide", "ai/vector-search/hnsw", "ai/vector-search/ivf", + "ai/vector-search/ivf-on-disk", + "ai/vector-search/pq-on-disk", "ai/vector-search/index-management", "ai/vector-search/resource-estimation", "ai/vector-search/quantization-survey",