-
Notifications
You must be signed in to change notification settings - Fork 986
Description
Subject: Query Planner Fails to Validate Valid ABFSS Path with Wildcard (**)
Component: Storage - Azure
Apache Drill Version: 1.22.0
Summary:
A SELECT query against a specific directory path on Azure Blob Storage (using the ABFSS connector) fails during the validation phase with an "Object not found" error. However, Drill's own file listing tools (SHOW FILES) can see and list the contents of the exact same path, and a global wildcard query can read the data successfully.
The issue appears to be a bug in the query planner's path validation logic. The planner seems to develop a "stuck" or "corrupted" state for certain directory names, refusing to acknowledge them in SELECT statements while other parts of Drill can access them without issue. The bug persists even after restarting the Drillbit and completely deleting/recreating the storage plugin.
Environment:
- Storage Plugin:
file - Connection Type: Azure Blob Storage (
abfss://<container>@<account>.dfs.core.windows.net) - Authentication:
SharedKey
Storage Plugin Configuration:
{
"type": "file",
"enabled": true,
"connection": "abfss://<container>@<account>.dfs.core.windows.net",
"config": {
"fs.azure.account.auth.type": "SharedKey",
"fs.azure.account.key.observercondenseddata.dfs.core.windows.net": "...",
"fs.azure.createRemoteFileSystemDuringInitialization": "false",
"fs.azure.io.list.recursive": "true"
},
"workspaces": {
"root": {
"location": "/",
"writable": false,
"allowRecursiveScan": true
},
"monthly": {
"location": "/prod-condenser-logs-1-Month/",
"writable": false,
"allowRecursiveScan": true
},
"daily": {
"location": "/prod-condenser-logs-1-day/",
"writable": false,
"allowRecursiveScan": true
},
"hourly": {
"location": "/prod-condenser-logs-1-hour/",
"writable": false,
"allowRecursiveScan": true
}
},
"formats": {
"log": {
"type": "logRegex",
"extension": "log",
"regex": "^(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2},\\d{3}) - (\\w+) - (.*)|^(.+)",
"maxErrors": 100000,
"schema": [
{"fieldName": "log_timestamp", "fieldType": "TIMESTAMP", "format": "yyyy-MM-dd HH:mm:ss,SSS"},
{"fieldName": "log_level"},
{"fieldName": "structured_message"},
{"fieldName": "unstructured_line"}
]
}
}
}Directory Structure on Azure:
/
├── prod-condenser-logs-1-Month/
│ └── 2025/
│ └── 07/
├── prod-condenser-logs-1-day/
│ └── 2025/
│ ├── 07/
│ └── 08/
└── prod-condenser-logs-1-hour/
└── 2025/
└── ...
Steps to Reproduce:
-
A query on a sibling directory works correctly: The following query against the
...-1-Monthdirectory executes successfully every time.SELECT * FROM az.root.`prod-condenser-logs-1-Month/2025/**` LIMIT 10;
-
An identical query on the target directory fails: The following query against the
...-1-daydirectory consistently fails.SELECT * FROM az.root.`prod-condenser-logs-1-day/2025/**` LIMIT 10;
-
Drill's listing tools prove the path is visible: Contradicting the query failure, the
SHOW FILEScommand can see and list the contents of the failing directory, proving the path is valid and accessible to Drill.-- This command SUCCEEDS and shows the '2025' directory within SHOW FILES FROM az.root.`prod-condenser-logs-1-day`;
Expected Behavior:
The SELECT query against az.root.prod-condenser-logs-1-day/2025/**`` should execute successfully, just as the query against the sibling ...-1-Month directory does.
Actual Behavior:
The query fails during the validation phase with the error:
VALIDATION ERROR: ... Object 'prod-condenser-logs-1-day/2025/**' not found within 'az.root'
Troubleshooting Steps Attempted (All Failed to Resolve the Issue):
- Restarting the Drillbit: The issue persists immediately after a full restart.
- Deleting and Recreating the Storage Plugin: The exact same behavior occurs after completely removing the
azplugin and recreating it from the saved configuration. - Renaming/Duplicating the Source Directory: Renaming the directory in Azure to a new name (e.g.,
prod-condenser-logs-daily-new) and querying it results in the same "Object not found" error. - Using Defined Workspaces: Querying via the
az.dailyworkspace (e.g.,FROM az.daily.2025/**``) also fails with the same error, even thoughSHOW FILES IN az.dailycorrectly lists the contents. REFRESH TABLE METADATA: This command fails because Drill does not recognize the paths as tables.
Final Workaround Discovered:
The only reliable method to query the data in the affected directories is to use a global wildcard from the root (FROM az.root.**``) and then filter the desired path using a WHERE clause. This proves the data is readable and the bug is specific to the planner's path validation.
-- This query WORKS and returns data from the '...-1-day' directory
SELECT *
FROM az.root.`**`
WHERE filepath LIKE '%/prod-condenser-logs-1-day/%'
LIMIT 10;This workaround suggests the core data reading engine is functional, but the upfront query validation is failing on specific path strings.