Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -102,3 +102,9 @@ IP data types store IP addresses in a binary format, which is faster and more sp
- **[IPv4](../sql-data-types/ip/IPV4.md)**: It stores IPv4 addresses as a 4-byte binary value. It is used in conjunction with the `ipv4_*` family of functions.
- **[IPv6](../sql-data-types/ip/IPV6.md)**: It stores IPv6 addresses as a 16-byte binary value. It is used in conjunction with the `ipv6_*` family of functions.

## File Type

The FILE type is a semantic data type that represents object storage file metadata. It stores a fixed-schema struct (URI, file name, content type, size, credentials) in JSON binary format.

- **[FILE](../sql-data-types/semi-structured/FILE.md)**: Represents a file in object storage (S3, OSS, COS, OBS, HDFS). FILE type can only be used in [Fileset Tables](../../sql-statements/table-and-view/table/CREATE-FILESET-TABLE.md) (ENGINE = fileset). It is designed to work with AI functions such as `embed()` for multimodal data processing.

133 changes: 133 additions & 0 deletions docs/sql-manual/basic-element/sql-data-types/semi-structured/FILE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
---
{
"title": "FILE | Semi Structured",
"language": "en",
"description": "The FILE type is a semantic data type that represents object storage file metadata, enabling Doris to handle file references with built-in metadata awareness.",
"sidebar_label": "FILE"
}
---

# FILE

## Overview

The FILE type is a semantic first-class data type that represents object storage file metadata. It stores a fixed-schema struct describing a remote file (URI, name, content type, size, credentials, etc.).

FILE is designed to work with the [Fileset Table](../../../sql-statements/table-and-view/table/CREATE-FILESET-TABLE.md) engine and the [`TO_FILE`](../../../sql-functions/scalar-functions/file-functions/to-file.md) function, enabling Doris to manage, query, and process files in object storage systems like S3, OSS, COS, and OBS.

## Internal Schema

Each FILE value is a JSONB object with the following fixed fields:

| Field | Type | Nullable | Description |
|-------|------|----------|-------------|
| `uri` | VARCHAR(4096) | No | Normalized object storage URI (e.g., `s3://bucket/path/file.csv`) |
| `file_name` | VARCHAR(512) | No | File name extracted from the URI |
| `content_type` | VARCHAR(128) | No | MIME type auto-detected from file extension |
| `size` | BIGINT | No | File size in bytes |
| `region` | VARCHAR(64) | Yes | Cloud region (e.g., `us-east-1`) |
| `endpoint` | VARCHAR(256) | Yes | Object storage endpoint URL |
| `ak` | VARCHAR(256) | Yes | Access key for S3-compatible storage |
| `sk` | VARCHAR(256) | Yes | Secret key for S3-compatible storage |
| `role_arn` | VARCHAR(256) | Yes | AWS IAM role ARN for cross-account access |
| `external_id` | VARCHAR(256) | Yes | External ID for role assumption |

## Type Constraints

- FILE type can **only** be used in [Fileset Tables](../../../sql-statements/table-and-view/table/CREATE-FILESET-TABLE.md) (tables with `ENGINE = fileset`). It **cannot** be used as a column in regular OLAP tables or other table engines.
- Fileset Tables are **read-only**. `INSERT`, `UPDATE`, and `DELETE` operations are **not supported**. FILE values are automatically materialized by the Fileset engine at query time.
- FILE type columns do **not** support the following operations:
- `ORDER BY`
- `GROUP BY`
- `DISTINCT`
- Aggregate functions (`MIN`, `MAX`, `COUNT`, `SUM`, etc.)
- `JOIN` equality conditions
- Window function `PARTITION BY` / `ORDER BY`
- Index creation
- FILE type must be used with specific functions (e.g., `TO_FILE`, `AI Functions`) or in the context of a Fileset Table.

## Constructing FILE Values

### Using a Fileset Table (Primary Method)

A [Fileset Table](../../../sql-statements/table-and-view/table/CREATE-FILESET-TABLE.md) automatically materializes FILE values by listing files in an object storage location. This is the primary way to work with FILE values:

```sql
CREATE TABLE my_files (
`file` FILE NULL
) ENGINE = fileset
PROPERTIES (
'location' = 's3://my-bucket/data/*',
's3.region' = 'us-east-1',
's3.endpoint' = 'https://s3.us-east-1.amazonaws.com',
's3.access_key' = 'AKIA...',
's3.secret_key' = 'wJa...'
);

SELECT * FROM my_files;
```

### Using the TO_FILE function

Use the [`TO_FILE`](../../../sql-functions/scalar-functions/file-functions/to-file.md) function to construct FILE values in a query expression. This is useful for validating individual file references or inline construction:

```sql
SELECT to_file(
's3://my-bucket/data/file.csv',
'us-east-1',
'https://s3.us-east-1.amazonaws.com',
'AKIA...',
'wJa...'
) AS file_obj;
```

:::caution Note
The `to_file` function constructs FILE values for query-time use only. Since Fileset Tables are read-only, you cannot INSERT file values constructed by `to_file` into a Fileset Table.
:::

## Supported MIME Types

The FILE type automatically detects the MIME content type from the file extension. Supported mappings include:

| Extension | Content Type |
|-----------|-------------|
| `.csv` | `text/csv` |
| `.json` | `application/json` |
| `.jsonl` | `application/x-ndjson` |
| `.parquet` | `application/x-parquet` |
| `.orc` | `application/x-orc` |
| `.avro` | `application/avro` |
| `.txt`, `.log`, `.tbl` | `text/plain` |
| `.xml` | `application/xml` |
| `.html`, `.htm` | `text/html` |
| `.pdf` | `application/pdf` |
| `.jpg`, `.jpeg` | `image/jpeg` |
| `.png` | `image/png` |
| `.gif` | `image/gif` |
| `.mp3` | `audio/mpeg` |
| `.mp4` | `video/mp4` |
| `.gz` | `application/gzip` |
| `.bz2` | `application/x-bzip2` |
| `.zst` | `application/zstd` |
| `.lz4` | `application/x-lz4` |
| `.zip` | `application/zip` |
| `.tar` | `application/x-tar` |
| Other | `application/octet-stream` |

## Notes

1. FILE type values are stored internally as JSONB binary format. The physical storage size per value depends on metadata content (typically 200–400 bytes).

2. The FILE type supports URI schemes including `s3://`, `oss://`, `cos://`, `obs://`, and `hdfs://`. Non-S3 schemes (`oss://`, `cos://`, `obs://`) are normalized to `s3://` internally for compatibility.

3. The `to_file` function validates object existence via a HEAD request to the object storage service, ensuring that the referenced file is accessible before constructing the FILE value.

## Using FILE with AI Functions

FILE type is designed to integrate with Doris AI functions for multimodal data processing. Examples:

```sql
-- Compute image embeddings
SELECT array_size(embed("qwen_mul_embed", file)) FROM my_fileset_table;

```
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
---
{
"title": "TO_FILE",
"language": "en",
"description": "Constructs a FILE type value from object storage URL and credentials, with automatic metadata extraction and object validation."
}
---

<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

## Description

Constructs a [FILE](../../../basic-element/sql-data-types/semi-structured/FILE.md) type value from an object storage URL and authentication credentials. This function is designed for query-time use — for example, to validate file accessibility or construct FILE values as part of query expressions.

:::caution Note
FILE type can only be used in [Fileset Tables](../../../sql-statements/table-and-view/table/CREATE-FILESET-TABLE.md) (ENGINE = fileset). You cannot INSERT `to_file()` results into regular OLAP tables. For bulk file listing, use a Fileset Table instead.
:::

For each input, the function:

1. Extracts metadata from the URL (file name, extension, MIME content type).
2. Validates that the object exists and is accessible via a HEAD request to the object storage service.
3. Retrieves the actual file size from the storage service.
4. Returns a FILE value containing the complete metadata.

## Syntax

```sql
TO_FILE(url, region, endpoint, ak, sk)
```

## Parameters

| Parameter | Type | Description |
|-----------|------|-------------|
| **url** | VARCHAR | The full object storage URL of the file (e.g., `s3://bucket/path/file.csv`). Supported URI schemes: `s3://`, `oss://`, `cos://`, `obs://` |
| **region** | VARCHAR | The cloud storage region (e.g., `us-east-1`, `cn-beijing`) |
| **endpoint** | VARCHAR | The object storage service endpoint URL (e.g., `https://s3.us-east-1.amazonaws.com`). The `http://` prefix will be added automatically if missing |
| **ak** | VARCHAR | The access key for authentication |
| **sk** | VARCHAR | The secret key for authentication |

## Return Value

Returns a value of [FILE](../../../basic-element/sql-data-types/semi-structured/FILE.md) type containing the following metadata:

- `uri`: Normalized object storage URI
- `file_name`: File name extracted from URL
- `content_type`: MIME type auto-detected from file extension
- `size`: Actual file size in bytes (retrieved from storage service)
- `region`: Storage region
- `endpoint`: Normalized endpoint URL
- `ak`: Access key
- `sk`: Secret key

Returns NULL if any input parameter is NULL (propagates nullability).

## Examples

### Basic usage

```sql
SELECT to_file(
's3://my-bucket/data/report.csv',
'us-east-1',
'https://s3.us-east-1.amazonaws.com',
'AKIA',
'wJalrXUtnFE'
);
```

```text
+--------------------------------------------------------------+
| to_file(...) |
+--------------------------------------------------------------+
| {"uri":"s3://my-bucket/data/report.csv","file_name": |
| "report.csv","content_type":"text/csv","size":1024000, |
| "region":"us-east-1","endpoint":"https://s3.us-east-1. |
| amazonaws.com","ak":"AKIA...","sk":"wJa...", |
| "role_arn":null,"external_id":null} |
+--------------------------------------------------------------+
```

### Using with OSS-compatible storage

```sql
SELECT to_file(
'oss://my-bucket/images/photo.jpg',
'cn-beijing',
'https://oss-cn-beijing.aliyuncs.com',
'your_access_key',
'your_secret_key'
);
```

:::tip
Non-S3 URI schemes (`oss://`, `cos://`, `obs://`) are automatically normalized to `s3://` internally for S3 SDK compatibility.
:::

## Error Handling

The function returns an error in the following cases:

- **Object not accessible**: If the HEAD request to the storage service fails (e.g., object does not exist, insufficient permissions), the function returns an `InvalidArgument` error with details about the URL and the storage service error message.

- **Client creation failure**: If the S3 client cannot be created for the given endpoint (e.g., invalid endpoint URL), the function returns an `InternalError`.

```sql
-- This will fail if the object does not exist
SELECT to_file(
's3://non-existent-bucket/file.csv',
'us-east-1',
'https://s3.us-east-1.amazonaws.com',
'AKIA...',
'wJa...'
);
-- ERROR: to_file: object 's3://non-existent-bucket/file.csv' is not accessible: ...
```

## Notes

1. The function makes a network request (HEAD) to the object storage service for **each row** processed. When processing large datasets, this may impact performance.

2. The endpoint URL must be accessible from the Doris BE nodes. Ensure network connectivity and firewall rules allow outbound access.

3. The `content_type` is determined by the file extension only. It does not inspect the actual file content.

4. For supported MIME type mappings, see the [FILE type documentation](../../../basic-element/sql-data-types/semi-structured/FILE.md#supported-mime-types).
Loading
Loading