Resource Info Fetcher (RIF)

## Summary
Design and implement a **Resource Info Fetcher** (RIF), a sidecar service that supplies OPA with metadata about *resources* (datasets, tables, topics, charts, …). It is supposed to be the resource-side part of **User Info Fetcher**.

- UIF: *"which groups is this user in?"*
- RIF: *"what do we know about this resource?"*.

Initially we want to implement DataHub as backend and then might include more as we progress. Designing this we want to provide the option to add more backends non-braking as well as think about a more generic approach once we have experience with different backends.

## Background
- OPA policies usefulness profits from attributes attached to the resource being accessed (owner, sensitivity tags, glossary terms, data product, …).
- Uniform Resource Name ([URN](https://datatracker.ietf.org/doc/html/rfc8141?__cf_chl_f_tk=TBPrlqtDrlNe5V6xFm52kI6Idtp23AK7Ib7sBb11sSs-1782989757-1.0.1.1-_RtQ2k1wbclrmvbvscFwZ4hPExP.Fv_iXcjc6oOJ1aQ)): `urn:<NAMESPACE-IDENTIFIER>:<NAMESPACE-SPECIFIC-STRING>`
- Commonly metadata lives in catalog systems (DataHub, OpenMetadata, …).
- The UIF pattern has proven that a sidecar with a stable API + swappable backends is a good fit for this problem shape.
- Spike by @soenkeliebau:
  - branch comparison: <https://github.com/stackabletech/opa-operator/compare/main...feat/resource-info-fetcher>
  - prototype repo: <https://github.com/stackabletech/resourceinfofetcher>

## Goals
- HTTP API served by RIF, consumed by OPA (via `http.send` from Rego, analogous to UIF).
- Backend adapters for DataHub (at first).
- Cache layer between RIF and the backend.
- Ship as a **separate binary** from UIF (working assumption — revisit later).


## Non-goals

- Write-back or metadata authoring — RIF is read-only.

## Out of scope
- Come up with an abstraction layer shared by all RIF backends. For v1 every backend returns it's own "proprietary" response
-  OpenMetadata, Synabi D-QUANTUM, Collibra


## Data model (starting point from the spike, can be used or discraded)

Two structs live in the prototype repo:

```rust
pub struct ResourceInfo {
    #[serde(default)] pub tags: Vec<String>,
    #[serde(default)] pub glossary_terms: Vec<String>,
    #[serde(default)] pub owners: Vec<String>,
    #[serde(default)] pub domain: Option<String>,
    #[serde(default)] pub data_products: Vec<String>,
    #[serde(default)] pub custom_properties: BTreeMap<String, serde_json::Value>,
    #[serde(default)] pub custom_attributes: BTreeMap<String, serde_json::Value>,
    #[serde(default)] pub fields: BTreeMap<String, FieldInfo>,
}

pub struct FieldInfo {
    #[serde(rename = "type")] pub type_: String,
    #[serde(default)] pub tags: Vec<String>,
    #[serde(default)] pub glossary_terms: Vec<String>,
}
```

For comparison, UIF's user model:

```rust
struct UserInfo {
    id: Option<String>,
    username: Option<String>,
    groups: Vec<String>,
    custom_attributes: HashMap<String, serde_json::Value>,
}
```

Supported UIF backends today: ActiveDirectory, Entra, Keycloak, OpenLDAP, XFSC AAS.

## Design questions — open,

### 1. API shape

We identified three approaches:
 
**Option A — URN-based (Sebastian):**

```
GET rif.com/metadata?urn=<urn>
```

Client (OPA / operator) constructs a URN string; RIF parses it. Terse and future-proof but requires *someone* to know the URN scheme for every backend/product combination. Product specific knowledge consolidated in Rego.

**Option B — Typed endpoints per resource kind**

```
GET rif.com/metadata/trinoTable?instance=…&catalog=…&schema=…&table=…
GET rif.com/metadata/trinoSchema?…
GET rif.com/metadata/kafkaTopic?instance=…&topic=…
…
```

Explicit and self-documenting, but the API surface grows with every product. Backends still have to translate to URNs internally. Consolidated in Rego.

*Concerns*: This requires the policy author to have explicit knowledge about the platform and metadata backend instead of *only* the metadata backend. It's explicit, but might appear cumbersome and non practicable.

**Option C — Flat query with kind/type params (Malte):**

```
GET rif.com/metadata/dataset?dataPlatform=<trino|iceberg|postgres>&value=<name>
GET rif.com/metadata/chart?id=<id>
```

Reflects resources of the backend used e.g. DataHubs dataset, charts. SDP specific knowledge expressed in params. Consolidated in Rego.

### 2. Catalog identity mapping

DataHub's URN format includes a `platform_instance` field. To construct e.g.:

```
urn:li:dataset:(urn:li:dataPlatform:trino,<database>.<schema>.<table>,<ENVIRONMENT>)
```

OPA needs to know the value of `platform_instance` for the Trino cluster being queried reflected by `metadata.name` of the `TrinoCluster` CR.

**Background Knowledge**: Trino 480 adds an "additional context file" mechanism ([[OPA] Add additional context file to Trino OPA plugin](https://github.com/trinodb/trino/pull/25993)). We plan to have `trino-operator` inject `TrinoCluster.metadata.name` and `.metadata.namespace` into this file so it reaches OPA input on every request.

Then, two configuration paths in the `OpaCluster` CR:

- If `.spec.backend.datahub.resourceMapping` is present, `metadata.name` is translated to a configured `platform_instance` value.
- If not, `metadata.name` is used verbatim as the `platform_instance`.

Same pattern presumably needed for Kafka, Iceberg, etc.

### 3. Table-name to URN mapping in DataHub

Example URNs:

```
urn:li:dataset:(urn:li:dataPlatform:trino,<schema>.<table>)
urn:li:dataset:(urn:li:dataPlatform:trino,<other-schema>.<table>)
urn:li:dataset:(urn:li:dataPlatform:iceberg,.<yet-another-schema>.<table>)
urn:li:dataset:(urn:li:dataPlatform:postgres,<database>.<schema>.<table>)
urn:li:chart:(superset,<chart-id>)
```

We couldn't agree on the exact table-name. Open point.

References:
- <https://docs.datahub.com/docs/generated/ingestion/sources/trino>
- <https://docs.datahub.com/docs/generated/metamodel/entities/dataset/#identity>

### 4. Field coverage

Customers current DataHub script uses a single GraphQL call against `dataset(urn: …)` and pulls:

- Dataset-level tags (names only, URN prefix stripped)
- Owners — emails of `CorpUser` owners only (**`CorpGroup` owners silently dropped** — is that intentional?)
- Columns: `fieldPath` + `type` for every column in `schemaMetadata.fields`
- Per-column tags from `editableSchemaMetadata` (user-added), **NOT** from `schemaMetadata.fields[].globalTags` (ingestion-time tags missed)

TODO: Go through DataHub's dataset entity docs and decide which additional fields belong in the initial `ResourceInfo` schema.

### 5. Separate binary vs. merge with UIF

Working assumption: **separate binary**. 
Rationale: independent lifecycle, different auth surface to different backends, easier to keep the UIF simple. 

Decision to be confirmed once RIF's dependencies are clearer.

### 6. Metadata standards worth considering

<https://egeria-project.org/> came up as a possible open metadata standard to design against. 
We should evaluate wether it's worth to lean on this "product" (also affected by wether we can find a useful abstraction of multiple backends).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Resource Info Fetcher (RIF) #848

Summary

Background

Goals

Non-goals

Out of scope

Data model (starting point from the spike, can be used or discraded)

Design questions — open,

1. API shape

2. Catalog identity mapping

3. Table-name to URN mapping in DataHub

4. Field coverage

5. Separate binary vs. merge with UIF

6. Metadata standards worth considering

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

Resource Info Fetcher (RIF) #848

Description

Summary

Background

Goals

Non-goals

Out of scope

Data model (starting point from the spike, can be used or discraded)

Design questions — open,

1. API shape

2. Catalog identity mapping

3. Table-name to URN mapping in DataHub

4. Field coverage

5. Separate binary vs. merge with UIF

6. Metadata standards worth considering

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions