diff --git a/docs/sql-manual/basic-element/sql-data-types/data-type-overview.md b/docs/sql-manual/basic-element/sql-data-types/data-type-overview.md index d2cbb092323d5..9102694cafe37 100644 --- a/docs/sql-manual/basic-element/sql-data-types/data-type-overview.md +++ b/docs/sql-manual/basic-element/sql-data-types/data-type-overview.md @@ -102,3 +102,9 @@ IP data types store IP addresses in a binary format, which is faster and more sp - **[IPv4](../sql-data-types/ip/IPV4.md)**: It stores IPv4 addresses as a 4-byte binary value. It is used in conjunction with the `ipv4_*` family of functions. - **[IPv6](../sql-data-types/ip/IPV6.md)**: It stores IPv6 addresses as a 16-byte binary value. It is used in conjunction with the `ipv6_*` family of functions. +## File Type + +The FILE type is a semantic data type that represents object storage file metadata. It stores a fixed-schema struct (URI, file name, content type, size, credentials) in JSON binary format. + +- **[FILE](../sql-data-types/semi-structured/FILE.md)**: Represents a file in object storage (S3, OSS, COS, OBS, HDFS). FILE type can only be used in [Fileset Tables](../../sql-statements/table-and-view/table/CREATE-FILESET-TABLE.md) (ENGINE = fileset). It is designed to work with AI functions such as `embed()` for multimodal data processing. + diff --git a/docs/sql-manual/basic-element/sql-data-types/semi-structured/FILE.md b/docs/sql-manual/basic-element/sql-data-types/semi-structured/FILE.md new file mode 100644 index 0000000000000..1bddd2fb84ddc --- /dev/null +++ b/docs/sql-manual/basic-element/sql-data-types/semi-structured/FILE.md @@ -0,0 +1,133 @@ +--- +{ + "title": "FILE | Semi Structured", + "language": "en", + "description": "The FILE type is a semantic data type that represents object storage file metadata, enabling Doris to handle file references with built-in metadata awareness.", + "sidebar_label": "FILE" +} +--- + +# FILE + +## Overview + +The FILE type is a semantic first-class data type that represents object storage file metadata. It stores a fixed-schema struct describing a remote file (URI, name, content type, size, credentials, etc.). + +FILE is designed to work with the [Fileset Table](../../../sql-statements/table-and-view/table/CREATE-FILESET-TABLE.md) engine and the [`TO_FILE`](../../../sql-functions/scalar-functions/file-functions/to-file.md) function, enabling Doris to manage, query, and process files in object storage systems like S3, OSS, COS, and OBS. + +## Internal Schema + +Each FILE value is a JSONB object with the following fixed fields: + +| Field | Type | Nullable | Description | +|-------|------|----------|-------------| +| `uri` | VARCHAR(4096) | No | Normalized object storage URI (e.g., `s3://bucket/path/file.csv`) | +| `file_name` | VARCHAR(512) | No | File name extracted from the URI | +| `content_type` | VARCHAR(128) | No | MIME type auto-detected from file extension | +| `size` | BIGINT | No | File size in bytes | +| `region` | VARCHAR(64) | Yes | Cloud region (e.g., `us-east-1`) | +| `endpoint` | VARCHAR(256) | Yes | Object storage endpoint URL | +| `ak` | VARCHAR(256) | Yes | Access key for S3-compatible storage | +| `sk` | VARCHAR(256) | Yes | Secret key for S3-compatible storage | +| `role_arn` | VARCHAR(256) | Yes | AWS IAM role ARN for cross-account access | +| `external_id` | VARCHAR(256) | Yes | External ID for role assumption | + +## Type Constraints + +- FILE type can **only** be used in [Fileset Tables](../../../sql-statements/table-and-view/table/CREATE-FILESET-TABLE.md) (tables with `ENGINE = fileset`). It **cannot** be used as a column in regular OLAP tables or other table engines. +- Fileset Tables are **read-only**. `INSERT`, `UPDATE`, and `DELETE` operations are **not supported**. FILE values are automatically materialized by the Fileset engine at query time. +- FILE type columns do **not** support the following operations: + - `ORDER BY` + - `GROUP BY` + - `DISTINCT` + - Aggregate functions (`MIN`, `MAX`, `COUNT`, `SUM`, etc.) + - `JOIN` equality conditions + - Window function `PARTITION BY` / `ORDER BY` + - Index creation +- FILE type must be used with specific functions (e.g., `TO_FILE`, `AI Functions`) or in the context of a Fileset Table. + +## Constructing FILE Values + +### Using a Fileset Table (Primary Method) + +A [Fileset Table](../../../sql-statements/table-and-view/table/CREATE-FILESET-TABLE.md) automatically materializes FILE values by listing files in an object storage location. This is the primary way to work with FILE values: + +```sql +CREATE TABLE my_files ( + `file` FILE NULL +) ENGINE = fileset +PROPERTIES ( + 'location' = 's3://my-bucket/data/*', + 's3.region' = 'us-east-1', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.access_key' = 'AKIA...', + 's3.secret_key' = 'wJa...' +); + +SELECT * FROM my_files; +``` + +### Using the TO_FILE function + +Use the [`TO_FILE`](../../../sql-functions/scalar-functions/file-functions/to-file.md) function to construct FILE values in a query expression. This is useful for validating individual file references or inline construction: + +```sql +SELECT to_file( + 's3://my-bucket/data/file.csv', + 'us-east-1', + 'https://s3.us-east-1.amazonaws.com', + 'AKIA...', + 'wJa...' +) AS file_obj; +``` + +:::caution Note +The `to_file` function constructs FILE values for query-time use only. Since Fileset Tables are read-only, you cannot INSERT file values constructed by `to_file` into a Fileset Table. +::: + +## Supported MIME Types + +The FILE type automatically detects the MIME content type from the file extension. Supported mappings include: + +| Extension | Content Type | +|-----------|-------------| +| `.csv` | `text/csv` | +| `.json` | `application/json` | +| `.jsonl` | `application/x-ndjson` | +| `.parquet` | `application/x-parquet` | +| `.orc` | `application/x-orc` | +| `.avro` | `application/avro` | +| `.txt`, `.log`, `.tbl` | `text/plain` | +| `.xml` | `application/xml` | +| `.html`, `.htm` | `text/html` | +| `.pdf` | `application/pdf` | +| `.jpg`, `.jpeg` | `image/jpeg` | +| `.png` | `image/png` | +| `.gif` | `image/gif` | +| `.mp3` | `audio/mpeg` | +| `.mp4` | `video/mp4` | +| `.gz` | `application/gzip` | +| `.bz2` | `application/x-bzip2` | +| `.zst` | `application/zstd` | +| `.lz4` | `application/x-lz4` | +| `.zip` | `application/zip` | +| `.tar` | `application/x-tar` | +| Other | `application/octet-stream` | + +## Notes + +1. FILE type values are stored internally as JSONB binary format. The physical storage size per value depends on metadata content (typically 200–400 bytes). + +2. The FILE type supports URI schemes including `s3://`, `oss://`, `cos://`, `obs://`, and `hdfs://`. Non-S3 schemes (`oss://`, `cos://`, `obs://`) are normalized to `s3://` internally for compatibility. + +3. The `to_file` function validates object existence via a HEAD request to the object storage service, ensuring that the referenced file is accessible before constructing the FILE value. + +## Using FILE with AI Functions + +FILE type is designed to integrate with Doris AI functions for multimodal data processing. Examples: + +```sql +-- Compute image embeddings +SELECT array_size(embed("qwen_mul_embed", file)) FROM my_fileset_table; + +``` diff --git a/docs/sql-manual/sql-functions/scalar-functions/file-functions/to-file.md b/docs/sql-manual/sql-functions/scalar-functions/file-functions/to-file.md new file mode 100644 index 0000000000000..b7f1fc3b8d03c --- /dev/null +++ b/docs/sql-manual/sql-functions/scalar-functions/file-functions/to-file.md @@ -0,0 +1,174 @@ +--- +{ + "title": "TO_FILE", + "language": "en", + "description": "Constructs a FILE type value from object storage URL and credentials, with automatic metadata extraction and object validation." +} +--- + + + +## Description + +Constructs a [FILE](../../../basic-element/sql-data-types/semi-structured/FILE.md) type value from an object storage URL and authentication credentials. This function is designed for query-time use — for example, to validate file accessibility or construct FILE values as part of query expressions. + +:::caution Note +FILE type can only be used in [Fileset Tables](../../../sql-statements/table-and-view/table/CREATE-FILESET-TABLE.md) (ENGINE = fileset). You cannot INSERT `to_file()` results into regular OLAP tables. For bulk file listing, use a Fileset Table instead. +::: + +For each input, the function: + +1. Extracts metadata from the URL (file name, extension, MIME content type). +2. Validates that the object exists and is accessible via a HEAD request to the object storage service. +3. Retrieves the actual file size from the storage service. +4. Returns a FILE value containing the complete metadata. + +## Syntax + +```sql +TO_FILE(url, region, endpoint, ak, sk) +TO_FILE(url, region, endpoint, ak, sk, role_arn, external_id) +``` + +## Parameters + +| Parameter | Type | Description | +|-----------|------|-------------| +| **url** | VARCHAR | The full object storage URL of the file (e.g., `s3://bucket/path/file.csv`). Supported URI schemes: `s3://`, `oss://`, `cos://`, `obs://` | +| **region** | VARCHAR | The cloud storage region (e.g., `us-east-1`, `cn-beijing`) | +| **endpoint** | VARCHAR | The object storage service endpoint URL (e.g., `https://s3.us-east-1.amazonaws.com`). The `http://` prefix will be added automatically if missing | +| **ak** | VARCHAR | The access key for authentication | +| **sk** | VARCHAR | The secret key for authentication | +| **role_arn** | VARCHAR | authentication of role arn | +| **external_id** | VARCHAR | authentication of external id | + +## Return Value + +Returns a value of [FILE](../../../basic-element/sql-data-types/semi-structured/FILE.md) type containing the following metadata: + +- `uri`: Normalized object storage URI +- `file_name`: File name extracted from URL +- `content_type`: MIME type auto-detected from file extension +- `size`: Actual file size in bytes (retrieved from storage service) +- `region`: Storage region +- `endpoint`: Normalized endpoint URL +- `ak`: Access key +- `sk`: Secret key +- `role_arn`: IAM Role +- `external_id`: external id + +Returns NULL if any input parameter is NULL (propagates nullability). + +## Examples + +### Basic usage + +```sql +SELECT to_file( + 's3://my-bucket/data/report.csv', + 'us-east-1', + 'https://s3.us-east-1.amazonaws.com', + 'AKIA', + 'wJalrXUtnFE' +); +``` + +```text ++--------------------------------------------------------------+ +| to_file(...) | ++--------------------------------------------------------------+ +| {"uri":"s3://my-bucket/data/report.csv","file_name": | +| "report.csv","content_type":"text/csv","size":1024000, | +| "region":"us-east-1","endpoint":"https://s3.us-east-1. | +| amazonaws.com","ak":"AKIA...","sk":"wJa...", | +| "role_arn":null,"external_id":null} | ++--------------------------------------------------------------+ +``` + +```sql +SELECT to_file( + 's3://my-bucket/data/report.csv', + 'us-east-1', + 'https://s3.us-east-1.amazonaws.com', + '', + '', + 'arn:aws:iam::447051187841:role/lzq-role-test', + '1001' +); +``` + +```text ++--------------------------------------------------------------+ +| to_file(...) | ++--------------------------------------------------------------+ +| {"uri":"s3://my-bucket/data/report.csv","file_name": | +| "report.csv","content_type":"text/csv","size":1024000, | +| "region":"us-east-1","endpoint":"https://s3.us-east-1. | +| amazonaws.com","ak":null,"sk":null, | +| "role_arn":"arn:aws:iam::447051187841:role/lzq-role-test", | +| "external_id":"1001"} | ++--------------------------------------------------------------+ +``` + +### Using with OSS-compatible storage + +```sql +SELECT to_file( + 'oss://my-bucket/images/photo.jpg', + 'cn-beijing', + 'https://oss-cn-beijing.aliyuncs.com', + 'your_access_key', + 'your_secret_key' +); +``` + +:::tip +Non-S3 URI schemes (`oss://`, `cos://`, `obs://`) are automatically normalized to `s3://` internally for S3 SDK compatibility. +::: + +## Error Handling + +The function returns an error in the following cases: + +- **Object not accessible**: If the HEAD request to the storage service fails (e.g., object does not exist, insufficient permissions), the function returns an `InvalidArgument` error with details about the URL and the storage service error message. + +- **Client creation failure**: If the S3 client cannot be created for the given endpoint (e.g., invalid endpoint URL), the function returns an `InternalError`. + +```sql +-- This will fail if the object does not exist +SELECT to_file( + 's3://non-existent-bucket/file.csv', + 'us-east-1', + 'https://s3.us-east-1.amazonaws.com', + 'AKIA...', + 'wJa...' +); +-- ERROR: to_file: object 's3://non-existent-bucket/file.csv' is not accessible: ... +``` + +## Notes + +1. The function makes a network request (HEAD) to the object storage service for **each row** processed. When processing large datasets, this may impact performance. + +2. The endpoint URL must be accessible from the Doris BE nodes. Ensure network connectivity and firewall rules allow outbound access. + +3. The `content_type` is determined by the file extension only. It does not inspect the actual file content. + +4. For supported MIME type mappings, see the [FILE type documentation](../../../basic-element/sql-data-types/semi-structured/FILE.md#supported-mime-types). diff --git a/docs/sql-manual/sql-functions/table-valued-functions/s3 copy.md b/docs/sql-manual/sql-functions/table-valued-functions/s3 copy.md new file mode 100644 index 0000000000000..72c580485b29d --- /dev/null +++ b/docs/sql-manual/sql-functions/table-valued-functions/s3 copy.md @@ -0,0 +1,1098 @@ +# Apache Doris Parquet 读取流程详解 + +## 目录 + +1. [Parquet 文件格式概览](#1-parquet-文件格式概览) +2. [核心类层次结构](#2-核心类层次结构) +3. [整体读取流程概览](#3-整体读取流程概览) +4. [Phase 1: ParquetReader 初始化](#4-phase-1-parquetreader-初始化) +5. [Phase 2: 文件打开与元数据解析](#5-phase-2-文件打开与元数据解析) +6. [Phase 3: Reader 初始化 (init_reader)](#6-phase-3-reader-初始化-init_reader) +7. [Phase 4: 列填充设置 (set_fill_columns)](#7-phase-4-列填充设置-set_fill_columns) +8. [Phase 5: 数据读取 (get_next_block)](#8-phase-5-数据读取-get_next_block) +9. [Phase 6: Row Group 过滤](#9-phase-6-row-group-过滤) +10. [Phase 7: Page Index 过滤](#10-phase-7-page-index-过滤) +11. [Phase 8: RowGroupReader 数据读取](#11-phase-8-rowgroupreader-数据读取) +12. [Phase 9: Lazy Read(延迟读取)](#12-phase-9-lazy-read延迟读取) +13. [Phase 10: 列数据读取链路](#13-phase-10-列数据读取链路) +14. [关键优化机制](#14-关键优化机制) +15. [调用流程图](#15-调用流程图) + +--- + +## 1. Parquet 文件格式概览 + +Parquet 是一种列式存储格式,其文件结构如下: + +``` ++---------------------------+ +| Magic Number (PAR1) | ← 4 bytes ++---------------------------+ +| Row Group 0 | +| +-----------------------+| +| | Column Chunk 0 || +| | +- Data Page 0 -+ || +| | +- Data Page 1 -+ || +| | +- Data Page N -+ || +| +-----------------------+| +| | Column Chunk 1 || +| | +- Data Page 0 -+ || +| | +- Data Page N -+ || +| +-----------------------+| +| | ... || +| +-----------------------+| ++---------------------------+ +| Row Group 1 | +| +-----------------------+| +| | ... || +| +-----------------------+| ++---------------------------+ +| ... | ++---------------------------+ +| Footer | +| +- File Metadata ------+ | +| | - Schema | | +| | - Row Group Meta | | +| | - Column Meta | | +| | - Column Index | | +| | - Offset Index | | +| +-----------------------+ | +| Footer Length (4B) | +| Magic Number (PAR1) | ← 4 bytes ++---------------------------+ +``` + +**关键概念:** +- **Row Group**:数据的水平分区,每个 Row Group 包含一定数量的行。 +- **Column Chunk**:Row Group 中某一列的所有数据。 +- **Data Page**:Column Chunk 中的数据单元,是编码和压缩的最小单位。 +- **Column Index / Offset Index**:Page 级别的索引信息,用于 Page 级别的谓词过滤。 + +--- + +## 2. 核心类层次结构 + +Doris 中 Parquet 读取涉及的核心类如下: + +``` +ParquetReader (vparquet_reader.h) + │ + ├── FileMetaData (vparquet_file_metadata.h) + │ └── FieldDescriptor / FieldSchema (schema_desc.h) + │ + ├── PageIndex (vparquet_page_index.h) + │ ├── ColumnIndex (谓词过滤:min/max/null_counts) + │ └── OffsetIndex (Page 偏移量定位) + │ + ├── ParquetPredicate (parquet_predicate.h) + │ ├── ColumnStat (Row Group 级别统计信息) + │ └── PageIndexStat (Page 级别统计信息) + │ + └── RowGroupReader (vparquet_group_reader.h) + │ + ├── LazyReadContext (延迟读取上下文) + │ + └── ParquetColumnReader (vparquet_column_reader.h) + │ + ├── ScalarColumnReader (标量列读取) + │ + └── ColumnChunkReader (vparquet_column_chunk_reader.h) + │ + ├── PageReader (vparquet_page_reader.h) + │ + ├── LevelDecoder (level_decoder.h) + │ ├── RepetitionLevel Decoder + │ └── DefinitionLevel Decoder + │ + └── Decoder (decoder.h) + ├── FixLengthPlainDecoder + ├── ByteArrayPlainDecoder + ├── ByteArrayDictDecoder + ├── BoolPlainDecoder + ├── BoolRLEDecoder + ├── DeltaBitPackDecoder + └── ByteStreamSplitDecoder +``` + +--- + +## 3. 整体读取流程概览 + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ 上层调用 (Scanner) │ +│ 1. 创建 ParquetReader │ +│ 2. ParquetReader::init_reader() │ +│ 3. ParquetReader::set_fill_columns() │ +│ 4. 循环: ParquetReader::get_next_block() 直到 EOF │ +│ 5. ParquetReader::close() │ +└─────────────────┬───────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ ParquetReader (文件级别) │ +│ │ +│ ┌──────────────────────────────────────────────────────────┐ │ +│ │ _open_file() │ │ +│ │ ├── 创建 FileReader │ │ +│ │ └── 解析 Footer → FileMetaData │ │ +│ └──────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌──────────────────────────────────────────────────────────┐ │ +│ │ _next_row_group_reader() [循环遍历每个 Row Group] │ │ +│ │ ├── _is_misaligned_range_group() → 范围对齐检查 │ │ +│ │ ├── _process_min_max_bloom_filter() → Row Group 过滤 │ │ +│ │ │ ├── _process_column_stat_filter() │ │ +│ │ │ │ ├── Min/Max 统计过滤 │ │ +│ │ │ │ └── Bloom Filter 过滤 │ │ +│ │ │ └── _process_page_index_filter() │ │ +│ │ │ ├── 解析 OffsetIndex │ │ +│ │ │ ├── 解析 ColumnIndex │ │ +│ │ │ └── Page 级别 Min/Max 过滤 │ │ +│ │ ├── 创建 MergeRangeFileReader (IO 优化) │ │ +│ │ └── 创建 RowGroupReader │ │ +│ └──────────────────────────────────────────────────────────┘ │ +└─────────────────┬───────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ RowGroupReader (Row Group 级别) │ +│ │ +│ ┌──────────────────────────────────────────────────────────┐ │ +│ │ init() │ │ +│ │ ├── 为每列创建 ParquetColumnReader │ │ +│ │ ├── 字典过滤 (Dict Filter) 初始化 │ │ +│ │ └── 谓词重写 (_rewrite_dict_predicates) │ │ +│ └──────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌──────────────────────────────────────────────────────────┐ │ +│ │ next_batch() │ │ +│ │ ├── 非 Lazy Read: 读取所有列 → 过滤 → 返回 │ │ +│ │ └── Lazy Read: │ │ +│ │ ├── 1. 读取谓词列 │ │ +│ │ ├── 2. 执行过滤生成 FilterMap │ │ +│ │ ├── 3. 根据 FilterMap 读取剩余列(跳过不需要的行) │ │ +│ │ └── 4. 返回结果 │ │ +│ └──────────────────────────────────────────────────────────┘ │ +└─────────────────┬───────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ ParquetColumnReader → ColumnChunkReader │ +│ │ +│ ┌──────────────────────────────────────────────────────────┐ │ +│ │ read_column_data() │ │ +│ │ ├── PageReader::next_page() → 读取 Page Header │ │ +│ │ ├── load_page_data() → 加载 Page 数据 │ │ +│ │ ├── LevelDecoder → 解码 Rep/Def Levels │ │ +│ │ └── Decoder::decode_values() → 解码实际值 │ │ +│ └──────────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ +``` + +--- + +## 4. Phase 1: ParquetReader 初始化 + +### 构造函数 + +`ParquetReader` 有 4 个构造函数,适配不同场景(带/不带 Profile,带/不带 IOContext holder)。核心初始化步骤如下: + +```cpp +ParquetReader::ParquetReader(RuntimeProfile* profile, const TFileScanRangeParams& params, + const TFileRangeDesc& range, ...) { + // 1. 设置批次大小(至少 _MIN_BATCH_SIZE) + _batch_size = std::max(batch_size, _MIN_BATCH_SIZE); + + // 2. 初始化扫描范围 + _range_start_offset = range.start_offset; + _range_size = range.size; + + // 3. 初始化查询选项中的过滤开关 + _enable_filter_by_min_max = state->query_options().enable_parquet_filter_by_min_max; + _enable_filter_by_bloom_filter = state->query_options().enable_parquet_filter_by_bloom_filter; + + // 4. 初始化 Profile 计数器 + _init_profile(); + + // 5. 初始化文件系统属性(HDFS/S3/本地等) + _init_system_properties(); + + // 6. 初始化文件描述信息(路径、大小等) + _init_file_description(); +} +``` + +### Profile 指标 + +`_init_profile()` 初始化了大量的性能指标计数器,用于监控读取过程的各个阶段: + +| 指标分类 | 指标名 | 说明 | +|---------|--------|------| +| **Row Group 过滤** | `RowGroupsFiltered` | 被过滤掉的 Row Group 总数 | +| | `RowGroupsFilteredByMinMax` | 被 Min/Max 统计过滤的 Row Group 数 | +| | `RowGroupsFilteredByBloomFilter` | 被 Bloom Filter 过滤的 Row Group 数 | +| | `RowGroupsReadNum` | 实际读取的 Row Group 数 | +| | `RowGroupsTotalNum` | 总 Row Group 数 | +| **行过滤** | `FilteredRowsByGroup` | Row Group 级别过滤掉的行数 | +| | `FilteredRowsByPage` | Page 级别过滤掉的行数 | +| | `FilteredRowsByLazyRead` | Lazy Read 过滤掉的行数 | +| **IO** | `FilteredBytes` | 过滤掉的字节数 | +| | `FileNum` | 打开的文件数 | +| **时间** | `ColumnReadTime` | 列数据读取时间 | +| | `ParseMetaTime` | 元数据解析时间 | +| | `ParseFooterTime` | Footer 解析时间 | +| | `PageIndexFilterTime` | Page Index 过滤时间 | +| | `RowGroupFilterTime` | Row Group 过滤时间 | +| **解码** | `DecodeValueTime` | 值解码时间 | +| | `DecodeDictTime` | 字典解码时间 | +| | `DecodeLevelTime` | Level 解码时间 | +| | `DecodeNullMapTime` | Null Map 解码时间 | +| **Page Cache** | `PageCacheHitCount` | Page 缓存命中次数 | +| | `PageCacheMissingCount` | Page 缓存未命中次数 | + +--- + +## 5. Phase 2: 文件打开与元数据解析 + +### `_open_file()` 流程 + +``` +_open_file() + │ + ├── 1. 创建 FileReader + │ └── DelegateReader::create_file_reader() + │ ├── 根据文件系统类型创建对应 Reader(HDFS/S3/Local等) + │ └── 包装为 TracingFileReader(用于 IO 统计) + │ + └── 2. 解析 Footer(FileMetaData) + │ + ├── 检查文件大小 > 4 bytes(Magic Number) + │ + ├── 是否有 MetaCache? + │ ├── 有 → 从缓存获取 + │ │ └── _meta_cache->lookup() + │ └── 无 → 解析 Thrift Footer + │ └── parse_thrift_footer() + │ ├── 从文件尾部读取 Footer Length + │ ├── 读取 Footer 数据 + │ ├── 反序列化 Thrift 结构 + │ └── 构建 FileMetaData 对象 + │ └── init_schema() → 解析 Schema + │ + └── 验证 _file_metadata != nullptr +``` + +### Footer 解析细节 + +Parquet 文件的 Footer 位于文件末尾,包含: +1. **Schema**:列的名称、类型、嵌套结构等。 +2. **Row Group Metadata**:每个 Row Group 的元信息(行数、大小、各列的统计信息)。 +3. **Column Metadata**:每列的编码方式、压缩方式、字典页偏移量等。 + +Schema 解析结果存储在 `FieldDescriptor` 中,包含: +- `FieldSchema`:每个字段的定义,包括名称、类型、嵌套层级等。 +- `physical_column_index`:物理列在 Parquet 文件中的索引位置。 +- `definition_level` / `repetition_level`:嵌套类型的层级定义。 + +--- + +## 6. Phase 3: Reader 初始化 (init_reader) + +### `init_reader()` 流程 + +```cpp +Status ParquetReader::init_reader( + const std::vector& all_column_names, // 查询需要的所有列名 + std::unordered_map* col_name_to_block_idx, // 列名→Block位置映射 + const VExprContextSPtrs& conjuncts, // 所有谓词(WHERE条件) + phmap::flat_hash_map<...>& slot_id_to_predicates, // Column Predicate + const TupleDescriptor* tuple_descriptor, // 表结构描述 + ... + std::shared_ptr table_info_node_ptr, // Schema Change 信息 + bool filter_groups, // 是否进行 Row Group 过滤 + const std::set& column_ids, // 需要读取的列ID集合 + const std::set& filter_column_ids // 需要过滤的列ID集合 +) { +``` + +### 核心步骤 + +``` +init_reader() + │ + ├── 1. _open_file() → 打开文件 + 解析 Footer + │ + ├── 2. 获取 Row Group 总数 + │ └── _total_groups = _t_metadata->row_groups.size() + │ + ├── 3. 列映射(表列 ↔ 文件列) + │ │ + │ ├── 遍历查询需要的 all_column_names + │ │ ├── 在文件中存在 → required_file_columns + │ │ └── 在文件中不存在 → _missing_cols + │ │ + │ └── 遍历文件 Schema + │ └── 匹配 required_file_columns + │ ├── _read_file_columns (文件中的列名) + │ ├── _read_table_columns (表中的列名) + │ └── _read_table_columns_set (用于快速查找) + │ + └── 4. 保存谓词信息 + ├── _lazy_read_ctx.conjuncts = conjuncts + └── _lazy_read_ctx.slot_id_to_predicates = slot_id_to_predicates +``` + +### 列映射说明 + +由于 Schema Change(如列重命名)的存在,表列名和文件列名可能不一致。`TableSchemaChangeHelper::Node` 负责维护这种映射关系: + +``` +Table Column Name ──(table_info_node_ptr)──> File Column Name + "user_id" ──────────────────────> "uid" + "user_name" ──────────────────────> "user_name" + "new_col" ──────────────────────> (不存在 → missing_col) +``` + +--- + +## 7. Phase 4: 列填充设置 (set_fill_columns) + +### `set_fill_columns()` 流程 + +此方法设置分区列和缺失列的填充方式,并决定是否启用 Lazy Read 优化。 + +``` +set_fill_columns() + │ + ├── 1. 设置分区列和缺失列 + │ ├── _lazy_read_ctx.fill_partition_columns = partition_columns + │ └── _lazy_read_ctx.fill_missing_columns = missing_columns + │ + ├── 2. 收集谓词列 (predicate_columns) + │ └── 遍历所有 conjuncts + │ └── visit_slot() 递归收集 VSlotRef + │ → predicate_columns[col_name] = (col_id, slot_id) + │ + ├── 3. 构建 Push Down Predicates (用于 Row Group 过滤) + │ └── 遍历 slot_id_to_predicates + │ ├── 检查列是否存在于文件中 (_exists_in_file) + │ ├── 检查类型是否匹配 (_type_matches) + │ └── 构建 AndBlockColumnPredicate → _push_down_predicates + │ + ├── 4. 列分类(决定 Lazy Read 策略) + │ │ + │ ├── 对 _read_table_columns 中的每一列: + │ │ ├── 有谓词 → predicate_columns (先读) + │ │ └── 无谓词 → lazy_read_columns (后读) + │ │ + │ ├── 对分区列: + │ │ ├── 有谓词 → predicate_partition_columns + │ │ └── 无谓词 → partition_columns + │ │ + │ └── 对缺失列: + │ ├── 有谓词 → predicate_missing_columns + │ └── 无谓词 → missing_columns + │ + └── 5. 确定是否可以 Lazy Read + └── can_lazy_read = _enable_lazy_mat + && predicate_columns.size() > 0 + && lazy_read_columns.size() > 0 +``` + +### Lazy Read 列分类示意 + +``` +SELECT a, b, c, d FROM table WHERE a > 10 AND b = 'hello'; + +┌──────────────────────────────────────────────┐ +│ predicate_columns: [a, b] ← 先读取+过滤 │ +│ lazy_read_columns: [c, d] ← 后读取 │ +│ can_lazy_read: true │ +└──────────────────────────────────────────────┘ +``` + +--- + +## 8. Phase 5: 数据读取 (get_next_block) + +### `get_next_block()` 主循环 + +```cpp +Status ParquetReader::get_next_block(Block* block, size_t* read_rows, bool* eof) { + // 1. 如果没有当前 RowGroupReader 或已读完当前 Row Group + if (_current_group_reader == nullptr || _row_group_eof) { + // 获取下一个 Row Group 的 Reader + Status st = _next_row_group_reader(); + // 如果没有更多 Row Group → EOF + } + + // 2. COUNT 聚合优化 + if (_push_down_agg_type == TPushAggOp::type::COUNT) { + // 直接返回行数,不读实际数据 + auto rows = std::min(remaining_rows, batch_size); + // resize 列并返回 + } + + // 3. 读取数据 + _current_group_reader->next_batch(block, _batch_size, read_rows, &_row_group_eof); + + // 4. Row Group 读完后收集统计信息 + if (_row_group_eof) { + _column_statistics.merge(column_st); + // 累计 lazy_read_filtered_rows, predicate_filter_time 等 + } +} +``` + +--- + +## 9. Phase 6: Row Group 过滤 + +### `_process_min_max_bloom_filter()` 流程 + +这是 Row Group 级别的主过滤入口,按以下顺序执行过滤: + +``` +_process_min_max_bloom_filter() + │ + ├── 不需要过滤? (_filter_groups == false) + │ └── 直接返回整个 Row Group 的行范围 + │ + ├── 按行读取模式? (_read_by_rows) + │ └── 根据 _row_ids 确定读取范围 + │ + └── 正常过滤模式 + │ + ├── Step 1: _process_column_stat_filter() + │ │ + │ ├── Min/Max 统计过滤 + │ │ ├── 从 Column Metadata 中读取 min/max 值 + │ │ └── 用谓词评估是否可以跳过该 Row Group + │ │ + │ └── Bloom Filter 过滤 + │ ├── 从文件中读取 Bloom Filter 数据 + │ └── 检查谓词值是否存在于 Bloom Filter 中 + │ + └── Step 2: 如果 Row Group 未被过滤 + └── _process_page_index_filter() + └── Page 级别精细过滤(见下一节) +``` + +### Min/Max 统计过滤细节 + +``` +_process_column_stat_filter() + │ + ├── 遍历每个 push_down_predicate: + │ │ + │ ├── get_stat_func(stat, column_id): + │ │ ├── 找到对应的文件列 + │ │ ├── 获取 Column Metadata 中的统计信息 + │ │ └── ParquetPredicate::read_column_stats() + │ │ ├── 读取 min_value, max_value + │ │ ├── 读取 null_count, num_values + │ │ └── 处理 sort_order(有符号/无符号/未定义) + │ │ + │ ├── get_bloom_filter_func(stat, column_id): + │ │ ├── 检查 bloom_filter_offset 是否存在 + │ │ ├── 检查数据类型是否支持 Bloom Filter + │ │ ├── 检查缓存(避免同一列多次读取) + │ │ └── ParquetPredicate::read_bloom_filter() + │ │ ├── 从文件 bloom_filter_offset 处读取数据 + │ │ └── 构建 ParquetBlockSplitBloomFilter + │ │ + │ └── predicate->evaluate_and(stat): + │ ├── 如果返回 false → filter_group = true + │ └── 记录是 min_max 还是 bloom_filter 过滤的 + │ + └── 返回 filter_group 结果 +``` + +--- + +## 10. Phase 7: Page Index 过滤 + +### `_process_page_index_filter()` 流程 + +Page Index 过滤是更细粒度的过滤,发生在 Row Group 通过了 Row Group 级别过滤之后。 + +``` +_process_page_index_filter() + │ + ├── 前置检查 + │ ├── config::enable_parquet_page_index == false? → 读整个 Row Group + │ ├── _colname_to_slot_id == nullptr? → 读整个 Row Group + │ └── 没有 Page Index 范围信息? → 读整个 Row Group + │ + ├── 收集所有需要读取的物理列 ID (parquet_col_ids) + │ └── 递归处理复杂类型(Array/Map/Struct) + │ + ├── 解析 Offset Index(始终执行) + │ │ ★ 从 PR #55795 开始,Offset Index 总是被解析 + │ ├── 从文件中读取 Offset Index 数据 + │ ├── 为每个列解析 tparquet::OffsetIndex + │ └── 存入 _col_offsets[parquet_col_id] + │ + ├── 检查是否需要 Min/Max 过滤 + │ └── !_enable_filter_by_min_max || push_down_pred.empty()? + │ → 读整个 Row Group(但 Offset Index 已解析完毕) + │ + ├── 读取 Column Index + │ └── 从文件中批量读取 Column Index 数据 + │ + └── 构建 CachedPageIndexStat 并执行过滤 + │ + ├── 对每个谓词列构建 PageIndexStat: + │ ├── 解析 tparquet::ColumnIndex + │ ├── 获取每个 Page 的: + │ │ ├── min_values / max_values + │ │ ├── null_pages (是否全 null) + │ │ ├── null_counts (null 值数量) + │ │ └── 行范围 (first_row_index → next page first_row_index) + │ └── 缓存结果避免重复计算 + │ + └── 对每个 push_down_pred: + ├── predicate->evaluate_and(cached_page_index, &tmp_row_range) + └── ranges_intersection(candidate, tmp, candidate) + → 最终得到需要读取的 Page 行范围 +``` + +### Page Index 结构示意 + +``` +Column Index: Offset Index: +┌──────────────────────┐ ┌──────────────────────────────┐ +│ Page 0: │ │ Page 0: │ +│ min = 1, max = 100 │ │ offset = 1000 │ +│ null_count = 0 │ │ compressed_size = 500 │ +│ null_page = false │ │ first_row_index = 0 │ +├──────────────────────┤ ├──────────────────────────────┤ +│ Page 1: │ │ Page 1: │ +│ min = 101, max=200 │ │ offset = 1500 │ +│ null_count = 5 │ │ compressed_size = 480 │ +│ null_page = false │ │ first_row_index = 1000 │ +├──────────────────────┤ ├──────────────────────────────┤ +│ Page 2: │ │ Page 2: │ +│ min = N/A, max=N/A │ │ offset = 1980 │ +│ null_count = 1000 │ │ compressed_size = 100 │ +│ null_page = true │ │ first_row_index = 2000 │ +└──────────────────────┘ └──────────────────────────────┘ + +查询: WHERE col > 150 +→ Page 0 (max=100 < 150): 跳过 +→ Page 1 (max=200 > 150): 需要读取 +→ Page 2 (null_page): 跳过 +→ candidate_row_ranges = [{1000, 2000}] +``` + +--- + +## 11. Phase 8: RowGroupReader 数据读取 + +### `_next_row_group_reader()` - 创建 RowGroupReader + +``` +_next_row_group_reader() + │ + ├── 1. 遍历 Row Groups + │ │ + │ ├── 检查范围对齐 (_is_misaligned_range_group) + │ │ └── 分布式扫描时,每个 Scanner 只处理特定范围的 Row Group + │ │ + │ ├── 执行过滤 (_process_min_max_bloom_filter) + │ │ → 得到 candidate_row_ranges + │ │ + │ ├── 计算 group_size(需要读取的压缩大小) + │ │ + │ └── 如果 candidate_row_ranges.count() > 0 + │ → 找到可读的 Row Group,break + │ + ├── 2. IO 优化 - 创建文件读取器 + │ │ + │ ├── InMemoryFileReader? → 直接使用 + │ │ + │ └── 否 → _generate_random_access_ranges() + │ ├── 计算每列的读取范围 (chunk_start → chunk_end) + │ ├── 计算平均 IO 大小 + │ └── avg_io_size < SMALL_IO? + │ ├── 是 → MergeRangeFileReader(合并小IO) + │ └── 否 → 直接使用 _file_reader + │ + └── 3. 创建 RowGroupReader 并初始化 + │ + ├── new RowGroupReader(file_reader, read_columns, row_group, ...) + │ + └── _current_group_reader->init(schema, candidate_row_ranges, + │ col_offsets, tuple_descriptor, ...) + │ + └── init() 内部: + ├── 为每个读取列创建 ParquetColumnReader + ├── 检查字典过滤可行性 + └── 重写字典谓词 (_rewrite_dict_predicates) +``` + +### RowGroupReader::init() 细节 + +``` +RowGroupReader::init() + │ + ├── 1. 保存行范围 + │ ├── _read_ranges = row_ranges (来自 Page Index 过滤) + │ └── _remaining_rows = row_ranges.count() + │ + ├── 2. 创建列读取器 + │ │ 缓冲区大小 = min(MAX_COLUMN_BUF, MAX_GROUP_BUF / 列数) + │ │ + │ └── 对每个 read_table_col: + │ └── ParquetColumnReader::create() + │ ├── 获取文件列名和 FieldSchema + │ ├── 根据类型创建对应 Reader: + │ │ ├── Scalar → ScalarColumnReader + │ │ ├── Array → ArrayColumnReader + │ │ ├── Map → MapColumnReader + │ │ └── Struct → StructColumnReader + │ └── 如果有 Offset Index → 传入 offset_index + │ + ├── 3. 字典过滤初始化 + │ │ + │ ├── 检查每个谓词列是否可以字典过滤: + │ │ ├── 列类型为 STRING/VARCHAR + │ │ ├── Parquet 类型为 BYTE_ARRAY + │ │ ├── 整个 Column Chunk 是字典编码 + │ │ └── 谓词类型为 IN_PRED 或 BINARY_PRED + │ │ + │ └── _dict_filter_cols 记录可字典过滤的列 + │ + └── 4. 字典谓词重写 (_rewrite_dict_predicates) + │ + ├── 读取字典值到临时列 + ├── 执行谓词得到匹配的字典码 + ├── 构建新的 IN 谓词(基于字典码) + └── 如果没有匹配 → 标记 Row Group 整体跳过 +``` + +--- + +## 12. Phase 9: Lazy Read(延迟读取) + +### Lazy Read 原理 + +Lazy Read 是一种重要的优化技术。核心思想:先读取有谓词的列,执行过滤后,只对通过过滤的行读取其余列。 + +``` +传统读取: + 读取 col_a (1000行) → 读取 col_b (1000行) → 读取 col_c (1000行) → 过滤 → 返回 10行 + +Lazy Read: + 读取 col_a (1000行,有谓词) → 过滤 → 10行通过 + 读取 col_b (仅10行) → 读取 col_c (仅10行) → 返回 10行 +``` + +### `_do_lazy_read()` 流程 + +``` +_do_lazy_read() + │ + ├── 循环 (直到有数据通过过滤或 EOF): + │ │ + │ ├── Step 1: 读取谓词列 + │ │ └── _read_column_data(predicate_columns, batch_size, ...) + │ │ + │ ├── Step 2: 填充分区列和缺失列(谓词相关的) + │ │ ├── _fill_partition_columns(predicate_partition_columns) + │ │ └── _fill_missing_columns(predicate_missing_columns) + │ │ + │ ├── Step 3: 构建位置删除过滤器(Iceberg) + │ │ └── _build_pos_delete_filter() + │ │ + │ ├── Step 4: 执行谓词过滤 + │ │ ├── Resize first column (VExprContext 优化 trick) + │ │ ├── VExprContext::execute_conjuncts() → result_filter + │ │ └── 构建 FilterMap + │ │ + │ └── Step 5: 检查过滤结果 + │ ├── filter_all == true: + │ │ ├── 清空谓词列数据 + │ │ ├── _cached_filtered_rows += pre_read_rows + │ │ │ (缓存连续跳过的行数,可跳过整个 Page) + │ │ └── 继续循环 + │ └── 有数据通过 → break + │ + ├── 处理缓存的过滤行 + │ └── _rebuild_filter_map() → 在 FilterMap 前面补零 + │ + ├── Step 6: 读取 Lazy 列(使用 FilterMap 跳过不需要的行) + │ └── _read_column_data(lazy_read_columns, pre_read_rows, ..., filter_map) + │ └── 列读取器根据 filter_map 跳过被过滤的行 + │ + ├── Step 7: 过滤谓词列数据 + │ └── Block::filter_block_internal(block, all_predicate_col_ids, result_filter) + │ + └── Step 8: 填充剩余的分区列和缺失列 + ├── _fill_partition_columns(partition_columns, column_size) + └── _fill_missing_columns(missing_columns, column_size) +``` + +### FilterMap 工作原理 + +``` +FilterMap 示例 (batch_size=10): + +谓词列数据: [1, 5, 3, 8, 2, 9, 4, 7, 6, 0] +谓词: col > 5 + +result_filter: [0, 0, 0, 1, 0, 1, 0, 1, 1, 0] + ↑ ↑ ↑ ↑ + 8 9 7 6 + +Lazy 列读取时使用 FilterMap: + - filter=0 的行: skip_values() → 不读数据 + - filter=1 的行: decode_values() → 读取数据 + +最终只读取了 4 行的 Lazy 列数据,而不是 10 行 +``` + +--- + +## 13. Phase 10: 列数据读取链路 + +### `_read_column_data()` 流程 + +``` +RowGroupReader::_read_column_data() + │ + ├── 遍历每个需要读取的列: + │ │ + │ ├── 字典过滤列特殊处理: + │ │ └── 替换列类型为 Int32(存储字典码) + │ │ + │ ├── reset_filter_map_index() + │ │ + │ └── 循环读取直到 batch_size 或 EOF: + │ └── _column_readers[col]->read_column_data( + │ column_ptr, type, table_info_node, + │ filter_map, batch_size, &loop_rows, &eof, is_dict_filter) + │ + └── 验证所有列读取的行数一致 +``` + +### ScalarColumnReader::read_column_data() 流程 + +``` +ScalarColumnReader::read_column_data() + │ + ├── 1. 确保当前 Page 数据已加载 + │ └── _chunk_reader->load_page_data_idempotent() + │ + ├── 2. 读取 Definition Levels + │ └── def_level_decoder().decode(batch_size) → _def_levels + │ + ├── 3. 读取 Repetition Levels (嵌套类型) + │ └── rep_level_decoder().decode(batch_size) → _rep_levels + │ + ├── 4. 构建 ColumnSelectVector + │ │ 根据 Definition Level 确定每个值的状态: + │ │ - CONTENT: 非空值,需要从 Decoder 读取 + │ │ - NULL_DATA: 空值 + │ │ - FILTERED_CONTENT: 被 FilterMap 过滤的非空值(跳过) + │ │ - FILTERED_NULL: 被 FilterMap 过滤的空值(跳过) + │ │ + │ └── 根据行范围 (_row_ranges) 裁剪 + │ + ├── 5. 跳过被过滤的值 + │ └── _chunk_reader->skip_values(filtered_content_count) + │ + └── 6. 解码值到 Doris Column + └── _chunk_reader->decode_values(doris_column, type, select_vector, is_dict_filter) + │ + └── Decoder::decode_values() + ├── FixLengthPlainDecoder (INT32/INT64/FLOAT/DOUBLE 等) + ├── ByteArrayPlainDecoder (STRING/BINARY) + ├── ByteArrayDictDecoder (字典编码的 STRING) + ├── BoolPlainDecoder (BOOLEAN) + ├── DeltaBitPackDecoder (INT32/INT64 Delta编码) + └── ByteStreamSplitDecoder (FLOAT/DOUBLE) +``` + +### ColumnChunkReader 内部工作流 + +``` +ColumnChunkReader 状态机: + + ┌──────────┐ + │ HEADER │ ← 初始状态 / next_page() 后 + └────┬─────┘ + │ load_page_data() + ▼ + ┌──────────┐ + │DATA_LOADED│ ← Page 数据已加载到内存 + └────┬─────┘ + │ decode_values() / skip_values() + │ (remaining_num_values == 0 时) + ▼ + ┌──────────┐ + │ PAGE_END │ → has_next_page()? → next_page() → 回到 HEADER + └──────────┘ + +每个 Page 的处理: + 1. next_page() + ├── PageReader 读取 Page Header (Thrift 反序列化) + ├── 获取 compressed_page_size / uncompressed_page_size / num_values + └── 检查 Page Cache + + 2. load_page_data() + ├── 读取压缩数据 + ├── 解压缩 (如果需要) + │ └── BlockCompressionCodec::decompress() + ├── 初始化 Level Decoder (Rep + Def) + └── 初始化 Value Decoder + + 3. decode_values() / skip_values() + ├── 根据 ColumnSelectVector 决定读取/跳过 + └── 更新 remaining_num_values +``` + +### PageReader 与 Page Cache + +``` +PageReader::next_page() + │ + ├── 有 Offset Index? + │ ├── 是 → 根据 Offset Index 直接定位 Page + │ │ └── 支持跳过不需要的 Page (skip_page) + │ └── 否 → 顺序读取所有 Page Header + │ + └── Page Cache 检查 (enable_parquet_file_page_cache) + │ + ├── 检查是否缓存解压后数据 + │ └── should_cache_decompressed() + │ └── 压缩比 <= parquet_page_cache_decompress_threshold? + │ ├── 是 → 缓存解压后数据 + │ └── 否 → 缓存压缩数据 + │ + ├── 缓存命中 → 直接使用缓存数据 + └── 缓存未命中 → 从文件读取 + 写入缓存 +``` + +--- + +## 14. 关键优化机制 + +### 14.1 多级过滤 + +``` +┌───────────────────────────────────────────┐ +│ Level 1: Row Group Min/Max 过滤 │ 粗粒度,效率最高 +│ ← 基于 Column Metadata 的统计信息 │ +├───────────────────────────────────────────┤ +│ Level 2: Row Group Bloom Filter 过滤 │ 精确判断值是否存在 +│ ← 从文件读取 Bloom Filter 数据 │ +├───────────────────────────────────────────┤ +│ Level 3: Page Index Min/Max 过滤 │ 细粒度到 Page 级别 +│ ← 基于 Column Index 的 Page 级别统计 │ +├───────────────────────────────────────────┤ +│ Level 4: Dict Filter 过滤 │ 字典编码列的高效过滤 +│ ← 基于字典值重写谓词 │ +├───────────────────────────────────────────┤ +│ Level 5: Lazy Read 过滤 │ 读取时的行级别过滤 +│ ← 先读谓词列过滤,再读其余列 │ +├───────────────────────────────────────────┤ +│ Level 6: Position Delete (Iceberg/Paimon) │ 行级别删除标记 +│ ← 根据删除文件标记跳过特定行 │ +└───────────────────────────────────────────┘ +``` + +### 14.2 IO 优化 + +| 优化 | 说明 | +|------|------| +| **MergeRangeFileReader** | 当平均 IO 大小较小时,合并多个小 IO 请求为一个大请求 | +| **BufferedStreamReader** | 流式缓冲读取,减少系统调用次数 | +| **TracingFileReader** | IO 追踪统计,不影响性能 | +| **FileMetaCache** | Footer 元数据缓存,避免重复解析同一文件的 Footer | +| **Page Cache** | 数据页缓存,支持压缩/解压后两种缓存策略 | + +### 14.3 Lazy Materialization + +``` +传统模式: + IO: ████████████████████ (读取所有列的所有行) + CPU: ████████ (解码所有数据后过滤) + +Lazy Read 模式: + IO: ████ (只读谓词列) + ██ (只读通过过滤的行的其余列) + CPU: ██ (解码谓词列) + █ (解码少量通过的行) + 跳过: ░░░░░░░░░░░░░░ (被过滤掉的行不读取) +``` + +### 14.4 Dict Filter + +``` +原始数据: 字典: + "apple" 0 → "apple" + "banana" 1 → "banana" + "cherry" 2 → "cherry" + "apple" 3 → "date" + "date" + "banana" + +存储为字典码: [0, 1, 2, 0, 3, 1] + +查询: WHERE col = 'apple' + 1. 在字典中查找 'apple' → 字典码 0 + 2. 重写谓词: WHERE dict_code = 0 + 3. 整数比较比字符串比较快得多 + 4. 如果字典中不存在 → 整个 Row Group 跳过 +``` + +--- + +## 15. 调用流程图 + +### 完整调用时序图 + +``` +Scanner ParquetReader RowGroupReader ColumnReader + │ │ │ │ + │── new ParquetReader() ──────>│ │ │ + │ │ │ │ + │── init_reader() ────────────>│ │ │ + │ │── _open_file() ─────────>│ │ + │ │ ├── create FileReader │ │ + │ │ └── parse Footer │ │ + │ │── map columns │ │ + │ │── save predicates │ │ + │ │<─────────────────────────│ │ + │<─────────────────────────────│ │ │ + │ │ │ │ + │── set_fill_columns() ──────>│ │ │ + │ │── classify columns │ │ + │ │── build push_down_pred │ │ + │ │── determine lazy_read │ │ + │<─────────────────────────────│ │ │ + │ │ │ │ + │══ LOOP ═════════════════════════════════════════════════════════════════════ │ + │ │ │ │ + │── get_next_block() ────────>│ │ │ + │ │ │ │ + │ │── _next_row_group_reader() │ + │ │ │ │ │ + │ │ ├── check range align │ │ + │ │ │ │ │ + │ │ ├── _process_min_max_bloom_filter() │ + │ │ │ ├── min/max filter │ │ + │ │ │ ├── bloom filter │ │ + │ │ │ └── page index filter │ + │ │ │ ├── parse offset index │ + │ │ │ ├── parse column index │ + │ │ │ └── → candidate_row_ranges │ + │ │ │ │ │ + │ │ ├── create MergeRangeFileReader │ + │ │ │ │ │ + │ │ └── new RowGroupReader ────────────────────> │ + │ │ └── init() ───────>│ │ + │ │ │── create ColumnReaders + │ │ │── dict filter init │ + │ │ │── rewrite dict pred │ + │ │ │ │ + │ │── next_batch() ─────────>│ │ + │ │ │ │ + │ │ [Non-Lazy Read] │ │ + │ │ │── read_column_data()│ + │ │ │ (all columns) ───>│── decode levels + │ │ │ │── decode values + │ │ │<───────────────────│ + │ │ │── fill partition │ + │ │ │── fill missing │ + │ │ │── execute filter │ + │ │ │ │ + │ │ [Lazy Read] │ │ + │ │ │── read predicate cols + │ │ │ ──────────────────>│── decode + │ │ │<───────────────────│ + │ │ │── execute filter │ + │ │ │── → FilterMap │ + │ │ │── read lazy cols │ + │ │ │ (with FilterMap) ─>│── skip/decode + │ │ │<───────────────────│ + │ │ │── filter predicate │ + │ │ │ columns │ + │ │ │ │ + │ │<─────────────────────────│ │ + │<─────────────────────────────│ │ │ + │ │ │ │ + │══ END LOOP (EOF) ══════════════════════════════════════════════════════════ │ + │ │ │ │ + │── close() ─────────────────>│ │ │ + │ │── _collect_profile() │ │ + │<─────────────────────────────│ │ │ +``` + +### 关键数据流向 + +``` +File on Storage + │ + ▼ +FileReader (HDFS/S3/Local/...) + │ + ▼ +TracingFileReader (IO 统计包装) + │ + ├───────────────────────┐ + │ │ + ▼ ▼ +MergeRangeFileReader 直接读取 +(合并小IO) (大IO) + │ │ + ├───────────────────────┘ + │ + ▼ +BufferedStreamReader (流式缓冲) + │ + ▼ +PageReader (Page 迭代器) + │ + ├── Page Header (Thrift 反序列化) + ├── Page Data (压缩数据) + │ │ + │ ▼ + │ BlockCompressionCodec::decompress() + │ │ + │ ▼ + │ 解压后数据 + │ │ + │ ├── Level Data → LevelDecoder + │ │ ├── Rep Levels + │ │ └── Def Levels + │ │ + │ └── Value Data → Decoder + │ │ + │ ▼ + │ Doris Column (Block 中的列) + │ + ▼ +Page Cache (可选) +``` + +--- + +## 附录:相关源文件 + +| 文件 | 说明 | +|------|------| +| `vparquet_reader.h/cpp` | Parquet 文件级别读取器,负责整体流程控制 | +| `vparquet_group_reader.h/cpp` | Row Group 级别读取器,管理列读取和 Lazy Read | +| `vparquet_column_reader.h/cpp` | 列级别读取器,处理嵌套类型和 Level 解码 | +| `vparquet_column_chunk_reader.h/cpp` | Column Chunk 读取器,管理 Page 和 Decoder | +| `vparquet_page_reader.h/cpp` | Page 级别读取器,处理 Page Header 和 Page Cache | +| `vparquet_page_index.h/cpp` | Page Index 解析器(Column Index + Offset Index) | +| `vparquet_file_metadata.h/cpp` | 文件元数据封装 | +| `schema_desc.h/cpp` | Schema 描述和解析 | +| `parquet_predicate.h/cpp` | 谓词评估(Min/Max、Bloom Filter、Page Index) | +| `decoder.h` | 值解码器基类及各种编码实现 | +| `level_decoder.h` | Rep/Def Level 解码器 | +| `parquet_common.h` | 公共类型定义(RowRanges、FilterMap 等) | +| `parquet_column_convert.h` | Parquet 类型到 Doris 类型的转换 | +| `parquet_block_split_bloom_filter.h` | Parquet Split Block Bloom Filter 实现 | diff --git a/docs/sql-manual/sql-statements/table-and-view/table/CREATE-FILESET-TABLE.md b/docs/sql-manual/sql-statements/table-and-view/table/CREATE-FILESET-TABLE.md new file mode 100644 index 0000000000000..a7aa874854916 --- /dev/null +++ b/docs/sql-manual/sql-statements/table-and-view/table/CREATE-FILESET-TABLE.md @@ -0,0 +1,230 @@ +--- +{ + "title": "CREATE FILESET TABLE", + "language": "en", + "description": "Create a Fileset Table that represents a set of files in object storage. Each Fileset Table has exactly one column of FILE type and dynamically lists files from a storage location at query time." +} +--- + + + +## Description + +A Fileset Table is a virtual table that represents a set of files stored in object storage (S3, OSS, COS, OBS, HDFS, etc.). It has exactly **one column of [FILE](../../../basic-element/sql-data-types/semi-structured/FILE.md) type** and dynamically lists files from a storage location at query time. + +When queried, Doris lists the files in the specified storage location, filters them by the given pattern, and materializes the results as FILE type values — each containing the file's URI, name, content type, size, and storage credentials. + +Fileset Tables are **read-only** — `INSERT`, `UPDATE`, and `DELETE` operations are not supported. The file list is dynamically generated on every query. + +## Syntax + +```sql +CREATE TABLE [IF NOT EXISTS] +( + FILE NULL +) +ENGINE = fileset +PROPERTIES ( + 'location' = '', + '' = '' + [, ...] +); +``` + +## Parameters + +### Column Definition + +A Fileset Table must have exactly **one column** of [FILE](../../../basic-element/sql-data-types/semi-structured/FILE.md) type. No other columns can be defined. + +### PROPERTIES + +#### Required Properties + +| Property | Description | +|----------|-------------| +| `location` | The storage URI including a file pattern. Format: `scheme://bucket/path/pattern`. The pattern is everything after the last `/`. | + +#### Location Pattern + +The file pattern (after the last `/`) supports POSIX fnmatch glob syntax: + +| Pattern | Description | Example | +|---------|-------------|---------| +| `*` | Match all files | `s3://bucket/data/*` | +| `*.ext` | Match by extension | `s3://bucket/images/*.jpg` | +| `prefix*` | Match by prefix | `s3://bucket/logs/2024*` | +| `file.csv` | Exact match | `s3://bucket/data/file.csv` | +| `data_[0-9]*` | Character class | `s3://bucket/data/data_[0-9]*` | + +#### Storage Properties + +Storage properties depend on the URI scheme. Common S3-compatible properties: + +| Property | Description | +|----------|-------------| +| `s3.region` | Storage region (e.g., `us-east-1`) | +| `s3.endpoint` | Service endpoint URL | +| `s3.access_key` | Access key for authentication | +| `s3.secret_key` | Secret key for authentication | + +##### Authentication via IAM Role + +As an alternative to access key / secret key, you can authenticate using IAM Role assumption: + +| Property | Description | +|----------|-------------| +| `s3.region` | Storage region (e.g., `us-east-1`) | +| `s3.endpoint` | Service endpoint URL | +| `s3.role_arn` | IAM role ARN for cross-account access (e.g., `arn:aws:iam::123456789012:role/MyRole`) | +| `s3.external_id` | External ID for role trust policy (optional) | + +For other storage systems (OSS, COS, OBS), use corresponding property prefixes (e.g., `oss.region`, `cos.endpoint`). + +## Supported Storage Systems + +| URI Scheme | Storage System | +|------------|---------------| +| `s3://` | Amazon S3 / S3-compatible | +| `oss://` | Alibaba Cloud OSS | +| `cos://` | Tencent Cloud COS | +| `obs://` | Huawei Cloud OBS | +| `hdfs://` | Apache HDFS | + +## Examples + +### List all files in an S3 directory + +```sql +CREATE TABLE s3_files ( + `file` FILE NULL +) ENGINE = fileset +PROPERTIES ( + 'location' = 's3://my-bucket/data/*', + 's3.region' = 'us-east-1', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.access_key' = 'AKIA...', + 's3.secret_key' = 'wJa...' +); + +SELECT * FROM s3_files; +``` + +### List only CSV files + +```sql +CREATE TABLE csv_files ( + `file` FILE NULL +) ENGINE = fileset +PROPERTIES ( + 'location' = 's3://my-bucket/exports/*.csv', + 's3.region' = 'us-east-1', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.access_key' = 'AKIA...', + 's3.secret_key' = 'wJa...' +); + +SELECT * FROM csv_files; +``` + +### List files in OSS + +```sql +CREATE TABLE oss_images ( + `file` FILE NULL +) ENGINE = fileset +PROPERTIES ( + 'location' = 'oss://my-bucket/images/*.jpg', + 'oss.region' = 'cn-beijing', + 'oss.endpoint' = 'https://oss-cn-beijing.aliyuncs.com', + 'oss.access_key' = 'your_ak', + 'oss.secret_key' = 'your_sk' +); + +SELECT * FROM oss_images; +``` + +### Match a single specific file + +```sql +CREATE TABLE single_file ( + `file` FILE NULL +) ENGINE = fileset +PROPERTIES ( + 'location' = 's3://my-bucket/config/settings.json', + 's3.region' = 'us-east-1', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.access_key' = 'AKIA...', + 's3.secret_key' = 'wJa...' +); +``` + +### Authenticate using IAM Role + +```sql +CREATE TABLE role_files ( + `file` FILE NULL +) ENGINE = fileset +PROPERTIES ( + 'location' = 's3://cross-account-bucket/data/*', + 's3.region' = 'us-east-1', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.role_arn' = 'arn:aws:iam::123456789012:role/CrossAccountRole', + 's3.external_id' = 'my-external-id' +); +``` + +### Use with AI functions + +Fileset Tables are particularly powerful when combined with AI functions. For example, you can compute embeddings for images stored in object storage: + +```sql +-- Create a fileset table for images +CREATE TABLE test_jpg ( + `file` FILE NULL +) ENGINE = fileset +PROPERTIES ( + 'location' = 's3://my-bucket/images/*.jpg', + 's3.region' = 'cn-beijing', + 's3.endpoint' = 'https://oss-cn-beijing.aliyuncs.com', + 's3.access_key' = 'AKIA...', + 's3.secret_key' = 'wJa...' +); + +-- Compute image embeddings using a multimodal embedding model +SELECT array_size(embed("qwen_mul_embed", file)) FROM test_jpg; + +``` + +## Execution Model + +When a Fileset Table is queried: Each file is materialized as a FILE type value containing the complete metadata. + +## Notes + +1. Fileset Tables are **dynamic**: the file list is refreshed on every query. Adding or removing files from the storage location is reflected automatically. + +2. The `location` property must contain a file pattern after the last `/`. If no pattern is specified (e.g., `s3://bucket/path/`), it defaults to `*` (match all files). + +3. Only **flat directory listing** is performed — subdirectories are not traversed recursively. + +4. Each FILE value includes the storage credentials (`ak`, `sk`) specified in the table properties. Ensure appropriate security measures when sharing query results. + +5. Fileset Tables are internal tables (not external catalogs). They are created and managed within the Doris internal catalog. diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/basic-element/sql-data-types/data-type-overview.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/basic-element/sql-data-types/data-type-overview.md index 2a7650e21d743..05e25505d6eab 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/basic-element/sql-data-types/data-type-overview.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/basic-element/sql-data-types/data-type-overview.md @@ -91,3 +91,9 @@ IP 类型以二进制形式存储 IP 地址,比用字符串存储更省空间 1. [IPv4](../../basic-element/sql-data-types/ip/IPV4.md):以 4 字节二进制存储 IPv4 地址,配合 ipv4_* 系列函数使用。 2. [IPv6](../../basic-element/sql-data-types/ip/IPV6.md):以 16 字节二进制存储 IPv6 地址,配合 ipv6_* 系列函数使用。 + +## 文件类型 + +FILE 类型是一种语义化数据类型,用于表示对象存储中的文件元数据。它将固定结构(URI、文件名、内容类型、大小、凭证)以 JSON 二进制格式存储。 + +- **[FILE](../../basic-element/sql-data-types/semi-structured/FILE.md)**:表示对象存储(S3、OSS、COS、OBS、HDFS)中的文件。FILE 类型只能在 [Fileset 表](../../sql-statements/table-and-view/table/CREATE-FILESET-TABLE.md)(ENGINE = fileset)中使用,设计用于与 `embed()` 等 AI 函数配合,实现多模态数据处理。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/basic-element/sql-data-types/semi-structured/FILE.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/basic-element/sql-data-types/semi-structured/FILE.md new file mode 100644 index 0000000000000..902cfe763c49e --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/basic-element/sql-data-types/semi-structured/FILE.md @@ -0,0 +1,133 @@ +--- +{ + "title": "FILE | Semi Structured", + "language": "zh-CN", + "description": "FILE 类型是一种语义化的数据类型,用于表示对象存储中的文件元数据,使 Doris 能够原生感知和管理文件引用。", + "sidebar_label": "FILE" +} +--- + +# FILE + +## 概述 + +FILE 类型是一种语义化的数据类型,用于表示对象存储中的文件元数据。它将描述远程文件的固定结构(URI、文件名、内容类型、大小、凭证等)。 + +FILE 类型设计用于与 [Fileset 表](../../../sql-statements/table-and-view/table/CREATE-FILESET-TABLE.md)引擎和 [`TO_FILE`](../../../sql-functions/scalar-functions/file-functions/to-file.md) 函数配合使用,使 Doris 能够管理、查询和处理 S3、OSS、COS、OBS等对象存储系统中的文件。 + +## 内部结构 + +每个 FILE 值都是一个包含以下固定字段的 JSONB 对象: + +| 字段 | 类型 | 可空 | 描述 | +|------|------|------|------| +| `uri` | VARCHAR(4096) | 否 | 规范化的对象存储 URI(如 `s3://bucket/path/file.csv`) | +| `file_name` | VARCHAR(512) | 否 | 从 URI 中提取的文件名 | +| `content_type` | VARCHAR(128) | 否 | 根据文件扩展名自动检测的 MIME 类型 | +| `size` | BIGINT | 否 | 文件大小(字节) | +| `region` | VARCHAR(64) | 是 | 云存储区域(如 `us-east-1`) | +| `endpoint` | VARCHAR(256) | 是 | 对象存储服务端点 URL | +| `ak` | VARCHAR(256) | 是 | S3 兼容存储的访问密钥 | +| `sk` | VARCHAR(256) | 是 | S3 兼容存储的密钥 | +| `role_arn` | VARCHAR(256) | 是 | AWS IAM 角色 ARN,用于跨账户访问 | +| `external_id` | VARCHAR(256) | 是 | 角色信任关系中的外部 ID | + +## 类型约束 + +- FILE 类型**只能**在 [Fileset 表](../../../sql-statements/table-and-view/table/CREATE-FILESET-TABLE.md)(`ENGINE = fileset` 的表)中使用,**不能**在常规 OLAP 表或其他表引擎中作为列使用。 +- Fileset 表是**只读的**,**不支持** `INSERT`、`UPDATE` 和 `DELETE` 操作。FILE 值在查询时由 Fileset 引擎自动生成。 +- FILE 类型列**不支持**以下操作: + - `ORDER BY` + - `GROUP BY` + - `DISTINCT` + - 聚合函数(`MIN`、`MAX`、`COUNT`、`SUM` 等) + - `JOIN` 等值条件 + - 窗口函数的 `PARTITION BY` / `ORDER BY` + - 创建索引 +- FILE 类型必须与特定函数(如 `TO_FILE` 或 `AI 函数`)配合使用,或在 Fileset 表的场景中使用。 + +## 构造 FILE 值 + +### 使用 Fileset 表(主要方式) + +[Fileset 表](../../../sql-statements/table-and-view/table/CREATE-FILESET-TABLE.md)通过列举对象存储位置中的文件来自动生成 FILE 值。这是使用 FILE 值的主要方式: + +```sql +CREATE TABLE my_files ( + `file` FILE NULL +) ENGINE = fileset +PROPERTIES ( + 'location' = 's3://my-bucket/data/*', + 's3.region' = 'us-east-1', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.access_key' = 'AKIA...', + 's3.secret_key' = 'wJa...' +); + +SELECT * FROM my_files; +``` + +### 使用 TO_FILE 函数 + +使用 [`TO_FILE`](../../../sql-functions/scalar-functions/file-functions/to-file.md) 函数在查询表达式中构造 FILE 值。适用于验证单个文件引用或内联构造场景: + +```sql +SELECT to_file( + 's3://my-bucket/data/file.csv', + 'us-east-1', + 'https://s3.us-east-1.amazonaws.com', + 'AKIA...', + 'wJa...' +) AS file_obj; +``` + +:::caution 注意 +`to_file` 函数构造的 FILE 值仅用于查询时使用。由于 Fileset 表是只读的,不能将 `to_file` 构造的 FILE 值通过 INSERT 插入到 Fileset 表中。 +::: + +## 支持的 MIME 类型 + +FILE 类型根据文件扩展名自动检测 MIME 内容类型。支持的映射关系如下: + +| 扩展名 | 内容类型 | +|--------|---------| +| `.csv` | `text/csv` | +| `.json` | `application/json` | +| `.jsonl` | `application/x-ndjson` | +| `.parquet` | `application/x-parquet` | +| `.orc` | `application/x-orc` | +| `.avro` | `application/avro` | +| `.txt`、`.log`、`.tbl` | `text/plain` | +| `.xml` | `application/xml` | +| `.html`、`.htm` | `text/html` | +| `.pdf` | `application/pdf` | +| `.jpg`、`.jpeg` | `image/jpeg` | +| `.png` | `image/png` | +| `.gif` | `image/gif` | +| `.mp3` | `audio/mpeg` | +| `.mp4` | `video/mp4` | +| `.gz` | `application/gzip` | +| `.bz2` | `application/x-bzip2` | +| `.zst` | `application/zstd` | +| `.lz4` | `application/x-lz4` | +| `.zip` | `application/zip` | +| `.tar` | `application/x-tar` | +| 其他 | `application/octet-stream` | + +## 注意事项 + +1. FILE 类型的值在内部以 JSONB 二进制格式存储,每个值的物理存储大小取决于元数据内容(通常为 200–400 字节)。 + +2. FILE 类型支持的 URI 协议包括 `s3://`、`oss://`、`cos://`、`obs://` 和 `hdfs://`。非 S3 协议(`oss://`、`cos://`、`obs://`)在内部会被规范化为 `s3://` 以保证兼容性。 + +3. `to_file` 函数通过向对象存储服务发送 HEAD 请求来验证对象是否存在,确保引用的文件在构造 FILE 值之前是可访问的。 + +## 结合 AI 函数使用 + +FILE 类型旨在与 Doris 的 AI 函数集成,实现多模态数据处理。示例: + +```sql +-- 计算图片的向量嵌入 +SELECT array_size(embed("qwen_mul_embed", file)) FROM my_fileset_table; + +``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/file-functions/to-file.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/file-functions/to-file.md new file mode 100644 index 0000000000000..f66a259b3e490 --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/file-functions/to-file.md @@ -0,0 +1,168 @@ +--- +{ + "title": "TO_FILE", + "language": "zh-CN", + "description": "根据对象存储 URL 和凭证构造 FILE 类型的值,自动提取元数据并验证对象的可访问性。" +} +--- + + + +## 描述 + +根据对象存储 URL 和认证凭证构造一个 [FILE](../../../basic-element/sql-data-types/semi-structured/FILE.md) 类型的值。对于每条输入数据,该函数会: + +1. 从 URL 中提取元数据(文件名、扩展名、MIME 内容类型)。 +2. 通过向对象存储服务发送 HEAD 请求来验证对象是否存在且可访问。 +3. 从存储服务获取实际文件大小。 +4. 返回包含完整元数据的 FILE 值。 + +## 语法 + +```sql +TO_FILE(url, region, endpoint, ak, sk) +TO_FILE(url, region, endpoint, ak, sk, role_arn, external_id) +``` + +## 参数 + +| 参数 | 类型 | 描述 | +|------|------|------| +| **url** | VARCHAR | 文件的完整对象存储 URL(如 `s3://bucket/path/file.csv`)。支持的 URI 协议:`s3://`、`oss://`、`cos://`、`obs://` | +| **region** | VARCHAR | 云存储区域(如 `us-east-1`、`cn-beijing`) | +| **endpoint** | VARCHAR | 对象存储服务端点 URL(如 `https://s3.us-east-1.amazonaws.com`)。如果缺少 `http://` 前缀会自动添加 | +| **ak** | VARCHAR | 认证访问密钥 | +| **sk** | VARCHAR | 认证密钥 | +| **role_arn** | VARCHAR | 认证访问role arn | +| **external_id** | VARCHAR | 认证访问external id | + +## 返回值 + +返回一个 [FILE](../../../basic-element/sql-data-types/semi-structured/FILE.md) 类型的值,包含以下元数据: + +- `uri`:规范化的对象存储 URI +- `file_name`:从 URL 中提取的文件名 +- `content_type`:根据文件扩展名自动检测的 MIME 类型 +- `size`:实际文件大小(字节),从存储服务获取 +- `region`:存储区域 +- `endpoint`:规范化的端点 URL +- `ak`:访问密钥 +- `sk`:密钥 +- `role_arn`: IAM Role +- `external_id`: external id + +如果任何输入参数为 NULL,则返回 NULL。 + +## 示例 + +### 基本用法 + +```sql +SELECT to_file( + 's3://my-bucket/data/report.csv', + 'us-east-1', + 'https://s3.us-east-1.amazonaws.com', + 'AKIA', + 'wJalr' +); +``` + +```text ++--------------------------------------------------------------+ +| to_file(...) | ++--------------------------------------------------------------+ +| {"uri":"s3://my-bucket/data/report.csv","file_name": | +| "report.csv","content_type":"text/csv","size":1024000, | +| "region":"us-east-1","endpoint":"https://s3.us-east-1. | +| amazonaws.com","ak":"AKIA...","sk":"wJa...", | +| "role_arn":null,"external_id":null} | ++--------------------------------------------------------------+ +``` + +```sql +SELECT to_file( + 's3://my-bucket/data/report.csv', + 'us-east-1', + 'https://s3.us-east-1.amazonaws.com', + '', + '', + 'arn:aws:iam::447051187841:role/lzq-role-test', + '1001' +); +``` + +```text ++--------------------------------------------------------------+ +| to_file(...) | ++--------------------------------------------------------------+ +| {"uri":"s3://my-bucket/data/report.csv","file_name": | +| "report.csv","content_type":"text/csv","size":1024000, | +| "region":"us-east-1","endpoint":"https://s3.us-east-1. | +| amazonaws.com","ak":null,"sk":null, | +| "role_arn":"arn:aws:iam::447051187841:role/lzq-role-test", | +| "external_id":"1001"} | ++--------------------------------------------------------------+ +``` + +### 使用 OSS 兼容存储 + +```sql +SELECT to_file( + 'oss://my-bucket/images/photo.jpg', + 'cn-beijing', + 'https://oss-cn-beijing.aliyuncs.com', + 'your_access_key', + 'your_secret_key' +); +``` + +:::tip +非 S3 的 URI 协议(`oss://`、`cos://`、`obs://`)在内部会自动规范化为 `s3://`,以兼容 S3 SDK。 +::: + +## 错误处理 + +该函数在以下情况会返回错误: + +- **对象不可访问**:如果向存储服务发送的 HEAD 请求失败(例如对象不存在、权限不足),函数返回 `InvalidArgument` 错误,包含 URL 和存储服务的错误信息。 + +- **客户端创建失败**:如果无法为指定的端点创建 S3 客户端(例如无效的端点 URL),函数返回 `InternalError`。 + +```sql +-- 如果对象不存在,此操作将失败 +SELECT to_file( + 's3://non-existent-bucket/file.csv', + 'us-east-1', + 'https://s3.us-east-1.amazonaws.com', + 'AKIA...', + 'wJa...' +); +-- ERROR: to_file: object 's3://non-existent-bucket/file.csv' is not accessible: ... +``` + +## 注意事项 + +1. 该函数会为**每行**处理的数据向对象存储服务发送一个网络请求(HEAD)。处理大数据集时,这可能会影响性能。 + +2. 端点 URL 必须从 Doris BE 节点可访问。请确保网络连接和防火墙规则允许出站访问。 + +3. `content_type` 仅根据文件扩展名确定,不会检查实际文件内容。 + +4. 有关支持的 MIME 类型映射,请参阅 [FILE 类型文档](../../../basic-element/sql-data-types/semi-structured/FILE.md#支持的-mime-类型)。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/table-and-view/table/CREATE-FILESET-TABLE.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/table-and-view/table/CREATE-FILESET-TABLE.md new file mode 100644 index 0000000000000..1531809981d9a --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-statements/table-and-view/table/CREATE-FILESET-TABLE.md @@ -0,0 +1,230 @@ +--- +{ + "title": "CREATE FILESET TABLE", + "language": "zh-CN", + "description": "创建 Fileset 表,用于表示对象存储中的一组文件。每个 Fileset 表只有一个 FILE 类型的列,在查询时动态列举存储位置中的文件。" +} +--- + + + +## 描述 + +Fileset 表是一种虚拟表,用于表示存储在对象存储(S3、OSS、COS、OBS等)中的一组文件。它只有**一个 [FILE](../../../basic-element/sql-data-types/semi-structured/FILE.md) 类型的列**,在查询时动态列举存储位置中的文件。 + +当查询 Fileset 表时,Doris 会列举指定存储位置中的文件,按给定模式过滤,并将结果生成为 FILE 类型的值 — 每个值包含文件的 URI、名称、内容类型、大小和存储凭证。 + +Fileset 表是**只读的** — 不支持 `INSERT`、`UPDATE` 和 `DELETE` 操作。文件列表在每次查询时动态生成。 + +## 语法 + +```sql +CREATE TABLE [IF NOT EXISTS] +( + FILE NULL +) +ENGINE = fileset +PROPERTIES ( + 'location' = '', + '' = '' + [, ...] +); +``` + +## 参数 + +### 列定义 + +Fileset 表必须且只能有**一个** [FILE](../../../basic-element/sql-data-types/semi-structured/FILE.md) 类型的列,不能定义其他列。 + +### PROPERTIES + +#### 必需属性 + +| 属性 | 描述 | +|------|------| +| `location` | 包含文件匹配模式的存储 URI。格式:`scheme://bucket/path/pattern`。匹配模式为最后一个 `/` 之后的部分。 | + +#### 位置匹配模式 + +文件匹配模式(最后一个 `/` 之后的部分)支持 POSIX fnmatch 通配符语法: + +| 模式 | 描述 | 示例 | +|------|------|------| +| `*` | 匹配所有文件 | `s3://bucket/data/*` | +| `*.ext` | 按扩展名匹配 | `s3://bucket/images/*.jpg` | +| `prefix*` | 按前缀匹配 | `s3://bucket/logs/2024*` | +| `file.csv` | 精确匹配 | `s3://bucket/data/file.csv` | +| `data_[0-9]*` | 字符类匹配 | `s3://bucket/data/data_[0-9]*` | + +#### 存储属性 + +存储属性取决于 URI 协议。常用的 S3 兼容属性如下: + +| 属性 | 描述 | +|------|------| +| `s3.region` | 存储区域(如 `us-east-1`) | +| `s3.endpoint` | 服务端点 URL | +| `s3.access_key` | 认证访问密钥 | +| `s3.secret_key` | 认证密钥 | + +##### 通过 IAM 角色认证 + +除了 access key / secret key 方式外,还可以通过 IAM 角色来认证: + +| 属性 | 描述 | +|------|------| +| `s3.region` | 存储区域(如 `us-east-1`) | +| `s3.endpoint` | 服务端点 URL | +| `s3.role_arn` | 用于跨账户访问的 IAM 角色 ARN(如 `arn:aws:iam::123456789012:role/MyRole`) | +| `s3.external_id` | 角色信任策略中的外部 ID(可选) | + +对于其他存储系统(OSS、COS、OBS),使用对应的属性前缀(如 `oss.region`、`cos.endpoint`)。 + +## 支持的存储系统 + +| URI 协议 | 存储系统 | +|----------|---------| +| `s3://` | Amazon S3 / S3 兼容存储 | +| `oss://` | 阿里云 OSS | +| `cos://` | 腾讯云 COS | +| `obs://` | 华为云 OBS | +| `hdfs://` | Apache HDFS | + +## 示例 + +### 列举 S3 目录中的所有文件 + +```sql +CREATE TABLE s3_files ( + `file` FILE NULL +) ENGINE = fileset +PROPERTIES ( + 'location' = 's3://my-bucket/data/*', + 's3.region' = 'us-east-1', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.access_key' = 'AKIA...', + 's3.secret_key' = 'wJa...' +); + +SELECT * FROM s3_files; +``` + +### 仅列举 CSV 文件 + +```sql +CREATE TABLE csv_files ( + `file` FILE NULL +) ENGINE = fileset +PROPERTIES ( + 'location' = 's3://my-bucket/exports/*.csv', + 's3.region' = 'us-east-1', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.access_key' = 'AKIA...', + 's3.secret_key' = 'wJa...' +); + +SELECT * FROM csv_files; +``` + +### 列举 OSS 中的文件 + +```sql +CREATE TABLE oss_images ( + `file` FILE NULL +) ENGINE = fileset +PROPERTIES ( + 'location' = 'oss://my-bucket/images/*.jpg', + 'oss.region' = 'cn-beijing', + 'oss.endpoint' = 'https://oss-cn-beijing.aliyuncs.com', + 'oss.access_key' = 'your_ak', + 'oss.secret_key' = 'your_sk' +); + +SELECT * FROM oss_images; +``` + +### 匹配单个特定文件 + +```sql +CREATE TABLE single_file ( + `file` FILE NULL +) ENGINE = fileset +PROPERTIES ( + 'location' = 's3://my-bucket/config/settings.json', + 's3.region' = 'us-east-1', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.access_key' = 'AKIA...', + 's3.secret_key' = 'wJa...' +); +``` + +### 通过 IAM 角色认证 + +```sql +CREATE TABLE role_files ( + `file` FILE NULL +) ENGINE = fileset +PROPERTIES ( + 'location' = 's3://cross-account-bucket/data/*', + 's3.region' = 'us-east-1', + 's3.endpoint' = 'https://s3.us-east-1.amazonaws.com', + 's3.role_arn' = 'arn:aws:iam::123456789012:role/CrossAccountRole', + 's3.external_id' = 'my-external-id' +); +``` + +### 结合 AI 函数使用 + +Fileset 表与 AI 函数结合使用时非常强大。例如,可以对存储在对象存储中的图片计算向量嵌入: + +```sql +-- 为图片创建 fileset 表 +CREATE TABLE test_jpg ( + `file` FILE NULL +) ENGINE = fileset +PROPERTIES ( + 'location' = 's3://my-bucket/images/*.jpg', + 's3.region' = 'cn-beijing', + 's3.endpoint' = 'https://oss-cn-beijing.aliyuncs.com', + 's3.access_key' = 'AKIA...', + 's3.secret_key' = 'wJa...' +); + +-- 使用多模态嵌入模型计算图片的向量嵌入 +SELECT array_size(embed("qwen_mul_embed", file)) FROM test_jpg; + +``` + +## 执行模型 + +当查询 Fileset 表时:每个文件被生成为一个包含完整元数据的 FILE 类型值。 + +## 注意事项 + +1. Fileset 表是**动态的**:文件列表在每次查询时刷新。添加或删除存储位置中的文件会自动反映在查询结果中。 + +2. `location` 属性必须在最后一个 `/` 之后包含文件匹配模式。如果未指定模式(如 `s3://bucket/path/`),默认使用 `*`(匹配所有文件)。 + +3. 仅执行**平级目录列举** — 不会递归遍历子目录。 + +4. 每个 FILE 值包含表属性中指定的存储凭证(`ak`、`sk`)。共享查询结果时请确保采取适当的安全措施。 + +5. Fileset 表是内部表(不是外部 Catalog),在 Doris Internal Catalog 中创建和管理。