Summary
Design and implement a Resource Info Fetcher (RIF), a sidecar service that supplies OPA with metadata about resources (datasets, tables, topics, charts, …). It is supposed to be the resource-side part of User Info Fetcher.
- UIF: "which groups is this user in?"
- RIF: "what do we know about this resource?".
Initially we want to implement DataHub as backend and then might include more as we progress. Designing this we want to provide the option to add more backends non-braking as well as think about a more generic approach once we have experience with different backends.
Background
- OPA policies usefulness profits from attributes attached to the resource being accessed (owner, sensitivity tags, glossary terms, data product, …).
- Uniform Resource Name (URN):
urn:<NAMESPACE-IDENTIFIER>:<NAMESPACE-SPECIFIC-STRING>
- Commonly metadata lives in catalog systems (DataHub, OpenMetadata, …).
- The UIF pattern has proven that a sidecar with a stable API + swappable backends is a good fit for this problem shape.
- Spike by @soenkeliebau:
Goals
- HTTP API served by RIF, consumed by OPA (via
http.send from Rego, analogous to UIF).
- Backend adapters for DataHub (at first).
- Cache layer between RIF and the backend.
- Ship as a separate binary from UIF (working assumption — revisit later).
Non-goals
- Write-back or metadata authoring — RIF is read-only.
Out of scope
- Come up with an abstraction layer shared by all RIF backends. For v1 every backend returns it's own "proprietary" response
- OpenMetadata, Synabi D-QUANTUM, Collibra
Data model (starting point from the spike, can be used or discraded)
Two structs live in the prototype repo:
pub struct ResourceInfo {
#[serde(default)] pub tags: Vec<String>,
#[serde(default)] pub glossary_terms: Vec<String>,
#[serde(default)] pub owners: Vec<String>,
#[serde(default)] pub domain: Option<String>,
#[serde(default)] pub data_products: Vec<String>,
#[serde(default)] pub custom_properties: BTreeMap<String, serde_json::Value>,
#[serde(default)] pub custom_attributes: BTreeMap<String, serde_json::Value>,
#[serde(default)] pub fields: BTreeMap<String, FieldInfo>,
}
pub struct FieldInfo {
#[serde(rename = "type")] pub type_: String,
#[serde(default)] pub tags: Vec<String>,
#[serde(default)] pub glossary_terms: Vec<String>,
}
For comparison, UIF's user model:
struct UserInfo {
id: Option<String>,
username: Option<String>,
groups: Vec<String>,
custom_attributes: HashMap<String, serde_json::Value>,
}
Supported UIF backends today: ActiveDirectory, Entra, Keycloak, OpenLDAP, XFSC AAS.
Design questions — open,
1. API shape
We identified three approaches:
Option A — URN-based (Sebastian):
GET rif.com/metadata?urn=<urn>
Client (OPA / operator) constructs a URN string; RIF parses it. Terse and future-proof but requires someone to know the URN scheme for every backend/product combination. Product specific knowledge consolidated in Rego.
Option B — Typed endpoints per resource kind
GET rif.com/metadata/trinoTable?instance=…&catalog=…&schema=…&table=…
GET rif.com/metadata/trinoSchema?…
GET rif.com/metadata/kafkaTopic?instance=…&topic=…
…
Explicit and self-documenting, but the API surface grows with every product. Backends still have to translate to URNs internally. Consolidated in Rego.
Concerns: This requires the policy author to have explicit knowledge about the platform and metadata backend instead of only the metadata backend. It's explicit, but might appear cumbersome and non practicable.
Option C — Flat query with kind/type params (Malte):
GET rif.com/metadata/dataset?dataPlatform=<trino|iceberg|postgres>&value=<name>
GET rif.com/metadata/chart?id=<id>
Reflects resources of the backend used e.g. DataHubs dataset, charts. SDP specific knowledge expressed in params. Consolidated in Rego.
2. Catalog identity mapping
DataHub's URN format includes a platform_instance field. To construct e.g.:
urn:li:dataset:(urn:li:dataPlatform:trino,<database>.<schema>.<table>,<ENVIRONMENT>)
OPA needs to know the value of platform_instance for the Trino cluster being queried reflected by metadata.name of the TrinoCluster CR.
Background Knowledge: Trino 480 adds an "additional context file" mechanism ([OPA] Add additional context file to Trino OPA plugin). We plan to have trino-operator inject TrinoCluster.metadata.name and .metadata.namespace into this file so it reaches OPA input on every request.
Then, two configuration paths in the OpaCluster CR:
- If
.spec.backend.datahub.resourceMapping is present, metadata.name is translated to a configured platform_instance value.
- If not,
metadata.name is used verbatim as the platform_instance.
Same pattern presumably needed for Kafka, Iceberg, etc.
3. Table-name to URN mapping in DataHub
Example URNs:
urn:li:dataset:(urn:li:dataPlatform:trino,<schema>.<table>)
urn:li:dataset:(urn:li:dataPlatform:trino,<other-schema>.<table>)
urn:li:dataset:(urn:li:dataPlatform:iceberg,.<yet-another-schema>.<table>)
urn:li:dataset:(urn:li:dataPlatform:postgres,<database>.<schema>.<table>)
urn:li:chart:(superset,<chart-id>)
We couldn't agree on the exact table-name. Open point.
References:
4. Field coverage
Customers current DataHub script uses a single GraphQL call against dataset(urn: …) and pulls:
- Dataset-level tags (names only, URN prefix stripped)
- Owners — emails of
CorpUser owners only (CorpGroup owners silently dropped — is that intentional?)
- Columns:
fieldPath + type for every column in schemaMetadata.fields
- Per-column tags from
editableSchemaMetadata (user-added), NOT from schemaMetadata.fields[].globalTags (ingestion-time tags missed)
TODO: Go through DataHub's dataset entity docs and decide which additional fields belong in the initial ResourceInfo schema.
5. Separate binary vs. merge with UIF
Working assumption: separate binary.
Rationale: independent lifecycle, different auth surface to different backends, easier to keep the UIF simple.
Decision to be confirmed once RIF's dependencies are clearer.
6. Metadata standards worth considering
https://egeria-project.org/ came up as a possible open metadata standard to design against.
We should evaluate wether it's worth to lean on this "product" (also affected by wether we can find a useful abstraction of multiple backends).
Summary
Design and implement a Resource Info Fetcher (RIF), a sidecar service that supplies OPA with metadata about resources (datasets, tables, topics, charts, …). It is supposed to be the resource-side part of User Info Fetcher.
Initially we want to implement DataHub as backend and then might include more as we progress. Designing this we want to provide the option to add more backends non-braking as well as think about a more generic approach once we have experience with different backends.
Background
urn:<NAMESPACE-IDENTIFIER>:<NAMESPACE-SPECIFIC-STRING>Goals
http.sendfrom Rego, analogous to UIF).Non-goals
Out of scope
Data model (starting point from the spike, can be used or discraded)
Two structs live in the prototype repo:
For comparison, UIF's user model:
Supported UIF backends today: ActiveDirectory, Entra, Keycloak, OpenLDAP, XFSC AAS.
Design questions — open,
1. API shape
We identified three approaches:
Option A — URN-based (Sebastian):
Client (OPA / operator) constructs a URN string; RIF parses it. Terse and future-proof but requires someone to know the URN scheme for every backend/product combination. Product specific knowledge consolidated in Rego.
Option B — Typed endpoints per resource kind
Explicit and self-documenting, but the API surface grows with every product. Backends still have to translate to URNs internally. Consolidated in Rego.
Concerns: This requires the policy author to have explicit knowledge about the platform and metadata backend instead of only the metadata backend. It's explicit, but might appear cumbersome and non practicable.
Option C — Flat query with kind/type params (Malte):
Reflects resources of the backend used e.g. DataHubs dataset, charts. SDP specific knowledge expressed in params. Consolidated in Rego.
2. Catalog identity mapping
DataHub's URN format includes a
platform_instancefield. To construct e.g.:OPA needs to know the value of
platform_instancefor the Trino cluster being queried reflected bymetadata.nameof theTrinoClusterCR.Background Knowledge: Trino 480 adds an "additional context file" mechanism ([OPA] Add additional context file to Trino OPA plugin). We plan to have
trino-operatorinjectTrinoCluster.metadata.nameand.metadata.namespaceinto this file so it reaches OPA input on every request.Then, two configuration paths in the
OpaClusterCR:.spec.backend.datahub.resourceMappingis present,metadata.nameis translated to a configuredplatform_instancevalue.metadata.nameis used verbatim as theplatform_instance.Same pattern presumably needed for Kafka, Iceberg, etc.
3. Table-name to URN mapping in DataHub
Example URNs:
We couldn't agree on the exact table-name. Open point.
References:
4. Field coverage
Customers current DataHub script uses a single GraphQL call against
dataset(urn: …)and pulls:CorpUserowners only (CorpGroupowners silently dropped — is that intentional?)fieldPath+typefor every column inschemaMetadata.fieldseditableSchemaMetadata(user-added), NOT fromschemaMetadata.fields[].globalTags(ingestion-time tags missed)TODO: Go through DataHub's dataset entity docs and decide which additional fields belong in the initial
ResourceInfoschema.5. Separate binary vs. merge with UIF
Working assumption: separate binary.
Rationale: independent lifecycle, different auth surface to different backends, easier to keep the UIF simple.
Decision to be confirmed once RIF's dependencies are clearer.
6. Metadata standards worth considering
https://egeria-project.org/ came up as a possible open metadata standard to design against.
We should evaluate wether it's worth to lean on this "product" (also affected by wether we can find a useful abstraction of multiple backends).