Summary
The _table_catalog source_url field currently stores a human-readable
page URL (e.g. https://ourworldindata.org/grapher/population). There is
no structured field for the actual data file URL, which means an automated
"refresh this table from its source" operation cannot be driven purely from
catalog metadata — it requires prose parsing from notes or manual
re-entry of the download URL.
Proposed change
Add a new data_url field to _table_catalog and expose it as a
parameter in set_table_metadata. This is additive and non-breaking:
source_url continues to hold the human-readable page/reference URL,
source_description is unchanged, and data_url carries the machine-
actionable download URL.
Example for an OWID table:
| Field |
Value |
source_url |
https://ourworldindata.org/grapher/population (chart page — unchanged) |
source_description |
"OWID historical population, 10 000 BCE – 2023…" (unchanged) |
data_url |
https://ourworldindata.org/grapher/population.csv?v=1&csvType=full&useColumnShortNames=true ← new |
Why this matters — mechanical refresh
With data_url and the existing load_params (which already records
mode, schema, merge_key, database, etc.), a refresh becomes fully
mechanical with no prose parsing:
1. SELECT data_url, load_params
FROM _table_catalog
WHERE table_name = 'owid_population'
2. Infer format from data_url extension (.csv / .parquet / .json / etc.)
3. Download data_url → temp file
4. load_file(path=<temp_file>, **load_params)
This would enable:
- A
refresh_table tool that re-ingests any table from its source in one
call.
- Bulk refresh of all tables that have a
data_url set:
SELECT table_name FROM _table_catalog WHERE data_url IS NOT NULL.
- Scheduled refresh without any human-maintained refresh scripts.
Format inference rule
Infer from the last path segment's extension before any query string
— the same rule load_file already uses for local paths:
| URL ending |
Format |
.csv |
CSV |
.parquet |
Parquet |
.json / .jsonl |
JSON / NDJSON |
.arrow / .ipc |
Arrow IPC |
For URLs where the extension is absent or ambiguous (e.g. a presigned S3
URL with a UUID key), a source_format hint in load_params or an
additional data_format field could carry the override.
Related issue
This pairs with the rename-drops-metadata issue: both are about making
_table_catalog fully machine-actionable rather than just human-readable.
A refresh_table tool (future) would depend on both fixes — metadata
surviving renames, and a structured data_url to fetch from.
Environment
hyper-rust-api version: 0.2.3.re93d08d2
- Observed while loading 17 OWID datasets into the persistent database
and noting that refresh requires prose parsing of notes.
Summary
The
_table_catalogsource_urlfield currently stores a human-readablepage URL (e.g.
https://ourworldindata.org/grapher/population). There isno structured field for the actual data file URL, which means an automated
"refresh this table from its source" operation cannot be driven purely from
catalog metadata — it requires prose parsing from
notesor manualre-entry of the download URL.
Proposed change
Add a new
data_urlfield to_table_catalogand expose it as aparameter in
set_table_metadata. This is additive and non-breaking:source_urlcontinues to hold the human-readable page/reference URL,source_descriptionis unchanged, anddata_urlcarries the machine-actionable download URL.
Example for an OWID table:
source_urlhttps://ourworldindata.org/grapher/population(chart page — unchanged)source_descriptiondata_urlhttps://ourworldindata.org/grapher/population.csv?v=1&csvType=full&useColumnShortNames=true← newWhy this matters — mechanical refresh
With
data_urland the existingload_params(which already recordsmode,schema,merge_key,database, etc.), a refresh becomes fullymechanical with no prose parsing:
This would enable:
refresh_tabletool that re-ingests any table from its source in onecall.
data_urlset:SELECT table_name FROM _table_catalog WHERE data_url IS NOT NULL.Format inference rule
Infer from the last path segment's extension before any query string
— the same rule
load_filealready uses for local paths:.csv.parquet.json/.jsonl.arrow/.ipcFor URLs where the extension is absent or ambiguous (e.g. a presigned S3
URL with a UUID key), a
source_formathint inload_paramsor anadditional
data_formatfield could carry the override.Related issue
This pairs with the rename-drops-metadata issue: both are about making
_table_catalogfully machine-actionable rather than just human-readable.A
refresh_tabletool (future) would depend on both fixes — metadatasurviving renames, and a structured
data_urlto fetch from.Environment
hyper-rust-apiversion:0.2.3.re93d08d2and noting that refresh requires prose parsing of
notes.