Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 6 additions & 4 deletions docs/ai/vector-search/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,22 +58,22 @@ PROPERTIES (
);
```

- index_type: `hnsw` (for [Hierarchical Navigable Small World](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world)), `ivf` (for inverted file), or `ivf_on_disk` (for IVF with inverted lists stored on disk and served through cache)
- index_type: `hnsw` (for [Hierarchical Navigable Small World](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world)), `ivf` (for inverted file), `ivf_on_disk` (for IVF with inverted lists stored on disk and served through cache), or `pq_on_disk` (for filter-first reranking accelerated by PQ-encoded vectors stored on disk)
- metric_type: `l2_distance` means using L2 distance as the distance function
- dim: `128` means the vector dimension is 128
- quantizer: `flat` means each vector dimension is stored as original float32

| Parameter | Required | Supported/Options | Default | Description |
|-----------|----------|-------------------|---------|-------------|
| `index_type` | Yes | `hnsw`, `ivf`, `ivf_on_disk` | (none) | ANN index algorithm. Supports HNSW, in-memory IVF, and IVF On-Disk. |
| `index_type` | Yes | `hnsw`, `ivf`, `ivf_on_disk`, `pq_on_disk` | (none) | ANN index algorithm. Supports HNSW, in-memory IVF, IVF On-Disk, and PQ On-Disk for selective filter-first reranking. |
| `metric_type` | Yes | `l2_distance`, `inner_product` | (none) | Vector similarity/distance metric. L2 = Euclidean; inner_product can approximate cosine if vectors are normalized. |
| `dim` | Yes | Positive integer (> 0) | (none) | Vector dimension. All imported vectors must match or an error is raised. |
| `nlist` | No | Positive integer | `1024` | IVF inverted-list count. Effective when `index_type=ivf` or `index_type=ivf_on_disk`; larger values may improve recall/speed trade-offs but increase build overhead. |
| `max_degree` | No | Positive integer | `32` | HNSW M (max neighbors per node). Affects index memory and search performance. |
| `ef_construction` | No | Positive integer | `40` | HNSW efConstruction (candidate queue size during build). Larger gives better quality but slower build. |
| `quantizer` | No | `flat`, `sq8`, `sq4`, `pq` | `flat` | Vector encoding/quantization: `flat` = raw; `sq8`/`sq4` = scalar quantization (8/4 bit), `pq` = product quantization to reduce memory. |
| `pq_m` | Required when 'quantizer=pq' | Positive integer | (none) | Specifies how many subvectors are used (vector dimension dim must be divisible by pq_m). |
| `pq_nbits` | Required when 'quantizer=pq' | Positive integer | (none) | The number of bits used to represent each subvector, in faiss pq_nbits is generally required to be no greater than 24. |
| `pq_m` | Required when `quantizer=pq` or `index_type=pq_on_disk` | Positive integer | (none) | Number of subvectors. The vector dimension `dim` must be divisible by `pq_m`. |
| `pq_nbits` | Required when `quantizer=pq`; optional when `index_type=pq_on_disk` | Positive integer | `8` for `pq_on_disk` | Number of bits used to represent each subvector. In Faiss, `pq_nbits` is generally required to be no greater than 24. |

## If You Need Cosine Similarity

Expand Down Expand Up @@ -313,6 +313,8 @@ On 768-D Cohere-MEDIUM-1M and Cohere-LARGE-10M datasets, SQ8 reduces index size

Quantization introduces extra build-time overhead because each distance computation must decode quantized values. For 128-D vectors, build time increases with row count; SQ vs. FLAT can be up to ~10× slower to build.

For workloads dominated by highly selective filters such as `tenant_id = ?` or `user_id = ?`, Doris also provides [`pq_on_disk`](./pq-on-disk.md). Unlike global ANN structures such as HNSW or IVF, `pq_on_disk` is designed to accelerate vector reranking inside the filtered subset by using PQ-encoded vectors stored on disk. This makes it especially useful for multi-tenant vector search, where global ANN structures built on mixed-tenant segments may suffer recall degradation after tenant filtering.

Similarly, Doris also supports product quantization, but note that when using PQ, additional parameters need to be provided:

- `pq_m`: Indicates how many sub-vectors to split the original high-dimensional vector into (vector dimension dim must be divisible by pq_m).
Expand Down
286 changes: 286 additions & 0 deletions docs/ai/vector-search/pq-on-disk.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,286 @@
---
{
"title": "PQ On-Disk",
"language": "en",
"description": "PQ On-Disk is a disk-backed vector reranking mode in Apache Doris. It is designed for selective filter-first workloads such as multi-tenant vector search, and uses PQ-encoded vectors to accelerate brute-force distance evaluation on filtered rows."
}
---

<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# PQ On-Disk in Apache Doris

`pq_on_disk` is a vector index mode in Apache Doris for **filter-first vector search**. It stores Product Quantization (PQ) codes on disk, keeps only the PQ codebook and hot chunks in memory, and uses the compressed vectors to accelerate brute-force-style distance evaluation on rows that have already passed scalar filtering.

This feature is especially useful in **multi-tenant vector search**. In many SaaS-style workloads, vectors from many tenants are stored together in the same segment. If you build a global `hnsw` or `ivf` index on that mixed data and then query with predicates such as `WHERE tenant_id = ?`, the ANN recall can degrade significantly because the global recall structure was built across all tenants rather than for one tenant's local subset. `pq_on_disk` avoids this problem by not depending on a global cross-tenant recall structure. Instead, Doris first applies the tenant filter, then uses PQ codes to accelerate vector scoring inside the filtered subset.

## When to Use PQ On-Disk

Use `pq_on_disk` when your query pattern is usually:

```sql
WHERE <highly_selective_filter>
ORDER BY l2_distance_approximate(...) LIMIT N
```

Typical examples include:

- `WHERE tenant_id = ?`
- `WHERE user_id = ?`
- `WHERE category_id = ? AND status = 'active'`
- `WHERE tag MATCH_ANY '...'
ORDER BY l2_distance_approximate(...) LIMIT N`

This is a different operating point from global ANN search:

- `hnsw` and `ivf` are designed for **global ANN recall** across a large vector collection.
- `ivf_on_disk` keeps the IVF recall model but moves the main IVF data to disk to reduce memory pressure.
- `pq_on_disk` is designed for **filtered-subset reranking**, where the candidate set is already narrowed down by ordinary predicates and Doris needs a faster way to score those rows.

## Why It Helps in Multi-Tenant Search

Suppose a segment contains vectors from 10,000 tenants. A global HNSW or IVF index is built over all rows in the segment. If the query is:

```sql
SELECT doc_id
FROM tenant_embeddings
WHERE tenant_id = 10001
ORDER BY l2_distance_approximate(embedding, <query_vector>)
LIMIT 20;
```

The query only cares about one tenant's rows, but the global ANN structure was trained or connected using vectors from all tenants. The nearest paths, graph edges, or IVF partitions that are good for global recall are not necessarily good for recall **after tenant filtering**.

`pq_on_disk` addresses this case differently:

1. Doris first applies the scalar predicate such as `tenant_id = 10001`.
2. It obtains a filtered candidate set for that tenant.
3. Instead of computing full float32 brute-force distances on every filtered row, Doris uses PQ-encoded vectors to evaluate distances much faster.
4. PQ code data is read from disk in rowid order and reused through a dedicated chunk cache.

As a result, `pq_on_disk` is often a better fit than global ANN structures when:

- the filter is highly selective,
- recall under post-filter/global ANN is unstable,
- and full brute-force over raw vectors is still too expensive.

## Quick Start

### Create a table

The following example uses `tenant_id` as the main filter column:

```sql
CREATE TABLE tenant_embeddings (
tenant_id BIGINT NOT NULL,
doc_id BIGINT NOT NULL,
embedding ARRAY<FLOAT> NOT NULL,
INDEX idx_embedding (embedding) USING ANN PROPERTIES (
"index_type" = "pq_on_disk",
"metric_type" = "l2_distance",
"dim" = "768",
"pq_m" = "96",
"pq_nbits" = "8"
)
) ENGINE=OLAP
DUPLICATE KEY(tenant_id, doc_id)
DISTRIBUTED BY HASH(tenant_id) BUCKETS 8
PROPERTIES (
"replication_num" = "1"
);
```

### Basic query

```sql
SELECT doc_id,
l2_distance_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) AS score
FROM tenant_embeddings
WHERE tenant_id = 10001
ORDER BY score ASC
LIMIT 20;
```

This query pattern is the primary target of `pq_on_disk`: filter first, then do fast vector Top-N inside the filtered rows.

## How PQ On-Disk Works

At a high level:

1. Doris trains a PQ codebook for the segment.
2. Raw vectors are encoded into compact PQ codes.
3. PQ codes are stored on disk in rowid order.
4. At query time, Doris first evaluates ordinary predicates.
5. For rows that survive filtering, Doris loads the corresponding PQ chunks and computes approximate distances using PQ codes instead of full raw vectors.

So `pq_on_disk` is best understood as **PQ-accelerated filtered brute-force**, rather than a global ANN recall structure like HNSW or IVF.

## User-Facing Interfaces

### 1) Index DDL

Use `index_type="pq_on_disk"` in ANN index properties.

```sql
CREATE TABLE image_pool (
user_id BIGINT NOT NULL,
photo_id BIGINT NOT NULL,
embedding ARRAY<FLOAT> NOT NULL,
INDEX idx_emb (embedding) USING ANN PROPERTIES (
"index_type" = "pq_on_disk",
"metric_type" = "l2_distance",
"dim" = "768",
"pq_m" = "96",
"pq_nbits" = "8"
)
) ENGINE=OLAP
DUPLICATE KEY(user_id, photo_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 8
PROPERTIES ("replication_num" = "1");
```

### 2) Typical query patterns

Top-N after filtering:

```sql
SELECT photo_id
FROM image_pool
WHERE user_id = 10001
ORDER BY l2_distance_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) ASC
LIMIT 20;
```

Prepared-statement style query:

```sql
SELECT photo_id
FROM image_pool
WHERE user_id = ?
ORDER BY l2_distance_approximate(embedding, CAST(? AS ARRAY<FLOAT>)) ASC
LIMIT 20;
```

For inner-product search, sort in descending order:

```sql
SELECT photo_id
FROM image_pool
WHERE user_id = 10001
ORDER BY inner_product_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) DESC
LIMIT 20;
```

Range search is also supported:

```sql
SELECT photo_id
FROM image_pool
WHERE user_id = 10001
AND l2_distance_approximate(embedding, array_repeat(CAST(0.01 AS FLOAT), 768)) < 500.0
ORDER BY photo_id;
```

## Parameters and Constraints

### Index parameters

| Property | Required | Default | Description |
|---|---|---|---|
| `index_type` | Yes | - | Must be `pq_on_disk`. |
| `metric_type` | Yes | - | `l2_distance` or `inner_product`. |
| `dim` | Yes | - | Vector dimension. |
| `pq_m` | Yes | - | Number of PQ subquantizers. Must divide `dim`. |
| `pq_nbits` | No | `8` | Number of bits per subquantizer code. |

### Training behavior

`pq_on_disk` needs enough rows to train the PQ codebook. The minimum training row count is:

```text
(1 << pq_nbits) * 100
```

Examples:

- `pq_nbits = 8` requires at least `25600` training rows.
- `pq_nbits = 4` requires at least `1600` training rows.

If a segment does not have enough rows to train the PQ index, Doris can fall back to brute-force search for that segment.

## BE Cache Configuration

`pq_on_disk` uses a dedicated chunk cache for PQ code data:

- `ann_index_pq_chunk_cache_limit` (default: `60%`)
- `ann_index_pq_chunk_cache_stale_sweep_time_sec` (default: `1800`)

The percentage value of `ann_index_pq_chunk_cache_limit` is based on process-available memory (`mem_limit`), not total machine memory.

## Observability

`pq_on_disk` introduces a dedicated BE cache named `AnnIndexPqChunkCache`.

When troubleshooting, check the following first:

- Whether the query is actually selective enough.
- Whether the filtered rows have good locality.
- Whether the PQ chunk cache is large enough to avoid repeated disk reads.
- Whether some segments are falling back to brute force because they do not have enough rows for PQ training.

## Usage Notes

- `pq_on_disk` is intended for **filter-first** workloads, not for global ANN recall across the whole segment.
- It is particularly suitable for **multi-tenant vector search** where rows from many tenants are mixed in the same segment.
- It supports both `l2_distance` and `inner_product`, including Top-N and range-search style queries.
- Query result ordering must match metric semantics: `l2_distance_approximate` uses ascending order, while `inner_product_approximate` uses descending order.
- Data locality matters. It works best when rows for the same filter key are physically close so PQ chunk reads are more sequential.
- For very small segments or insufficient training data, Doris may not build the PQ index and can fall back to brute force.

## Best Practices

1. Choose `pq_on_disk` when the query pattern is usually **filter first, rerank second**.
2. Prefer it for **tenant-aware retrieval** such as `WHERE tenant_id = ? ORDER BY ... LIMIT N`.
3. Keep the filter column selective. The smaller the filtered candidate set, the more suitable `pq_on_disk` becomes.
4. Start with `pq_nbits = 8` unless you intentionally want a smaller code size at the cost of recall.
5. Choose `pq_m` so that `dim / pq_m` is reasonable for your model dimension and business recall target.
6. Use prepared statements for 768-D and higher query vectors to reduce SQL parsing overhead.
7. Validate on real business distributions, especially when tenant sizes are very uneven.

## How to Choose Between `hnsw`, `ivf_on_disk`, and `pq_on_disk`

Use `hnsw` when:

- You need high-recall global ANN search.
- Query latency is the top priority and enough memory is available.

Use `ivf_on_disk` when:

- You still need a global IVF-style ANN recall model.
- Memory is limited, but the query still searches a large global vector collection.

Use `pq_on_disk` when:

- The query already has a highly selective scalar filter.
- Rows from different tenants or users are mixed in the same segment.
- Global ANN recall under tenant/user filtering is poor.
- You want to accelerate filtered brute-force scoring with compressed vectors.

In short, `pq_on_disk` is not a replacement for all ANN structures. It is the right choice when the main problem is **efficient vector reranking inside a filtered subset**, especially in multi-tenant workloads.
Original file line number Diff line number Diff line change
Expand Up @@ -49,23 +49,23 @@ PROPERTIES (
"replication_num" = "1"
);
```
- index_type: 可选 `hnsw`([Hierarchical Navigable Small World 算法](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world))、`ivf`(倒排文件索引)`ivf_on_disk`(倒排列表落盘并通过缓存提供查询能力的 IVF)
- index_type: 可选 `hnsw`([Hierarchical Navigable Small World 算法](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world))、`ivf`(倒排文件索引)`ivf_on_disk`(倒排列表落盘并通过缓存提供查询能力的 IVF)或 `pq_on_disk`(将 PQ 编码向量落盘、用于过滤后向量重排加速
- metric_type: l2_distance 表示使用 L2 距离作为距离函数
- dim: 128 表示向量维度为 128
- quantizer: flat 表示按原始 float32 存储各维度


| 参数 | 是否必填 | 支持/可选值 | 默认值 | 说明 |
|------|----------|-------------|--------|------|
| `index_type` | 是 | 支持:`hnsw`、`ivf`、`ivf_on_disk` | (无) | 指定所使用的 ANN 索引算法。当前支持 HNSW、内存 IVF 和 IVF On-Disk。 |
| `index_type` | 是 | 支持:`hnsw`、`ivf`、`ivf_on_disk`、`pq_on_disk` | (无) | 指定所使用的 ANN 索引算法。当前支持 HNSW、内存 IVF、IVF On-Disk,以及面向高选择性过滤后重排的 PQ On-Disk。 |
| `metric_type` | 是 | `l2_distance`,`inner_product` | (无) | 指定向量相似度/距离度量方式。L2 为欧氏距离,inner_product 可用于余弦相似时需先归一化向量。 |
| `dim` | 是 | 正整数 (> 0) | (无) | 指定向量维度,后续导入的所有向量的维度必须与此一致,否则报错。 |
| `nlist` | 否 | 正整数 | `1024` | IVF 的倒排桶数量。在 `index_type=ivf` 或 `index_type=ivf_on_disk` 时生效;取值越大通常有助于召回率/速度权衡,但会增加构建开销。 |
| `max_degree` | 否 | 正整数 | `32` | HNSW 图中单个节点的最大邻居数(M),影响索引内存与搜索性能。 |
| `ef_construction` | 否 | 正整数 | `40` | HNSW 构建阶段的候选队列大小(efConstruction),越大构图质量越好但构建更慢。 |
| `quantizer` | 否 | `flat`,`sq8`,`sq4`, `pq` | `flat` | 指定向量编码/量化方式:`flat` 为原始存储,`sq8`/`sq4` 为标量量化(8/4 bit), `pq` 为乘积量化。 |
| `pq_m` | 'quantizer=pq' 时需要指定 | 正整数 | (无) | 指定将原始的高维向量分割成多少个子向量(向量维度 dim 必须能被 pq_m 整除)。 |
| `pq_nbits` | 'quantizer=pq' 时需要指定 | 正整数 | (无) | 指定每个子向量量化的比特数, 它决定了每个子空间码本的大小(k = 2 ^ pq_nbits), 在faiss中pq_nbits值一般要求不大于24。 |
| `pq_m` | `quantizer=pq` 或 `index_type=pq_on_disk` 时需要指定 | 正整数 | (无) | 指定将原始的高维向量分割成多少个子向量向量维度 `dim` 必须能被 `pq_m` 整除。 |
| `pq_nbits` | `quantizer=pq` 时需要指定;`index_type=pq_on_disk` 时可选 | 正整数 | `pq_on_disk` 默认 `8` | 指定每个子向量量化的比特数它决定了每个子空间码本的大小k = 2 ^ pq_nbits),在 Faiss 中一般要求不大于 24。 |

## 如果业务需要使用 Cosine 相似度

Expand Down Expand Up @@ -293,6 +293,8 @@ PROPERTIES (

量化会带来额外构建开销,原因是构建阶段需要大量距离计算,且每次计算需对量化值解码。以 128 维向量为例,随行数增长构建时间上升,SQ 相比 FLAT 可能引入约 10 倍构建成本。

对于以 `tenant_id = ?`、`user_id = ?` 等高选择性过滤为主的查询,Doris 还提供了 [`pq_on_disk`](./pq-on-disk.md)。它不像 HNSW / IVF 那样构建面向全局召回的结构,而是通过磁盘上的 PQ 编码向量,加速过滤后候选集上的向量重排。这使它在多租户向量检索场景下尤其有价值:当一个 segment 中混合了多个租户的数据时,全局 ANN 结构在指定租户后可能召回下降,而 `pq_on_disk` 更适合这种“先过滤、后重排”的模式。

类似的, Doris也支持乘积量化, 不过需要注意的是在使用PQ时需要提供额外的参数:

- `pq_m`: 表示将原始的高维向量分割成多少个子向量(向量维度 dim 必须能被 pq_m 整除)。
Expand Down
Loading